OIN booklet: Difference between revisions

Latest revision as of 14:59, 28 August 2025

Objects of interest and necessity

With the notion of an ‘object of interest’ a guided tour of a place, a museum or collection likely comes to mind. One may easily read this compilation of texts as a catalogue for such a tour in a social and technical system, where we stop and wonder about the different objects that, in one way or the other, take part in the generation of images by artificial intelligence (AI).

'A guided tour' perhaps also limits the understanding of what objects of interest are. In science, for instance, an object of interest sometimes refers to what one might call the potentiality of an object. Take for instance, the famous Kepler telescope whose mission was to search the Milky Way for exoplanets (planets outside our own solar system). Among all the stars, there are candidates for this, or so-called Kepler Objects of Interest (KOI) that are documented, indexed and catalogued. In similar ways, this catalogue is the outcome of an investigative process where we – by trying out different software, reading documentation and research, looking into communities of practice that experiment with AI image creation, and more – have sought to understand the things that make generative AI images possible; that is, the underlying dependencies on relations between communities, companies, models, technical units, and more in AI image creation. Within this system there is not just a functional apparatus, but also an ‘imaginary’; that is, there are underlying expectations and norms (for planetary existence, for instance) that are met in specific objects.

The catalogue, however, is strictly speaking not scientific, and should not be taken too seriously as such. It is not as if there is a set of defined parameters by which we have prioritized some objects over others. One can also think of an object of interest in a different way; as something that is not just the manifestation of an imaginary, but also what produces it. Take for instance Orhan Pamuk’s famous Museum of Innocence. It is a book by the Nobel Prize winning author, but also an entry ticket to a really existing museum in Istanbul, where one can find, amongst other items, a showcase of 4,213 cigarette butts, smoked by Füsun, the object of Kemal, the main character’s love.

During my eight years of going to the Keskins’ for supper, I was able to squirrel away 4,213 of Füsun’s cigarette butts. Each one of these had touched her rosy lips and entered her mouth, some even touching her tongue and becoming moist, as I would discover when I put my finger on the filter soon after she had stubbed the cigarette out.The stubs, reddened by her lovely lipstick, bore the unique impress of her lips at some moment whose memory was laden with anguish or bliss, making these stubs artifacts of a singular intimacy.[1]

The collection of objects are the things that makes the story, and also what makes the story real. Objects contain an associative power, that literally creates memories (the young Marcel in Proust’s In Search of Lost Times (À la recherche du temps perdu) who eats a madeleine cake to create a novel, is perhaps the most famous literary example of this). Therefore, this catalogue is not just a collection of the objects that make generative AI images, but also an exploration of an imaginary of AI image creation through the collection and exhibition of objects – and in particular, an imaginary of ‘autonomy’.

Most people’s experiences with generative AI image creation come from flagship platforms like OpenAI’s DALL-E, Adobe Firefly, Microsoft Image Creator, Midjourney, or other proprietary services. There is a whole ecology of services that are distinct yet often based on the same underlying models or techniques of so-called ‘diffusion’. Nevertheless, there are also communities who for different reasons seek some kind of independence and autonomy from the mainstream platforms. It may be that they are unsatisfied with the stylistic outputs; say, interested in not just manga style images, but a particular manga style (the English language image board or gallery Danbooru is an example of this, where much content is erotic). Others may have issues with the platform model itself, and how it compromises ideals of free/’libre’ and open-source software (aka F/LOSS). They want image generation to be more broadly available, free of costs, use less processing power, or open it up for new technical ideas and experimentation. For instance, OpenAI is not as open as the name indicates. It has a complicated history where commercialisation, partnerships and dependencies to other tech corporations (like Microsoft) have become increasingly central for its operations. For these reasons, autonomy from commercial and proprietary platforms of AI often imply visions of alternative infrastructures - more 'peer-to-peer' and decentralised than the platforms' 'client-server' relations between software and users. The objects presented in this catalogue all refer to autonomous practices of AI image generation, with different versions of what autonomy means and to various degrees of dependency; as autonomy in these pages is not simply understood as the possibility of a self-imposed law but more largely as an attempt to choose one's own attachments (or 'heteronomy'[2]). That is, rather than explaining how generative AI works (as many researchers and critics of AI call for[3]), our interest lies in opening up for an understanding of what it takes to make AI image generation work, and also to make it work separately from mainstream platforms and capital interests.

Our outset is ‘Stable Diffusion’, a generative AI model that produces images from text prompts. Characteristically, the company behind (Stability AI) uses the same ‘diffusion’ technology as many of the commercial services, but with heavily reduced labour costs in producing the data set behind, and the description of the images necessary to train the model. Indeed, this form of annotation is usually carried out by precarious and cheap 'data workers' in the 'gig economy' who describe images according to set categories. Although the labour is cheap, the annotation of images as a whole is an enormous and expensive task. However, explained a bit technically, Christoph Shumann, an independent high school teacher, found a way to ‘scrape’ the internet for links to images with alt text (text that describes images for people with disabilities) and filter out the non-sensical text (using Clip), thereby providing a fully annotated data set for only 5,000 USD (we have also included a text on both our own and others maps of this complex ecology in the catalogue, see also LAION). On this background, Stable Diffusion has been released under a community license that allows for research, non-commercial, and also limited commercial use.[4] That is, users can freely install and use Stable Diffusion under conditions similar to much other F/LOSS software.

Subsequently, there is range of other F/LOSS software that enables user interfaces to Stable Diffusion and a lively visual culture who uses and also builds on Stable Diffusion models. This includes, for instance, CivitAI that allows users to share and download AI models and, for its members, to both use its servers for currencies (virtual tokens called buzz) and show and sell their AI-generated images. DevianArt is another platform that functions in similar ways. Or, Hugging Face, which functions like a repository of user-created AI models that can be used in other F/LOSS applications (such as Draw Things or other interfaces to image generation and models) to generate images or ‘tweak’ the models, using so-called LoRAs. Some of these sites are heavily funded by venture capital. They are often communities of practice in one way or the other interested in autonomy and corporations geared towards maximizing value extraction. However, one also finds Stable Horde, that in a peer-to-peer fashion allows its community to access each other’s machines for processing power – contrary to conventional AI platforms where one depends on a corporate service and adopts an articulated approach to autonomy.

In other words, what autonomy is, and what it means to separate from capital interests is by no means uniform – the range of agents, dependencies, flows of capital, and so on, can be difficult to comprehend and is in constant flux. This, we have tried to capture in our description of the objects, guided by a set of questions that we address directly or indirectly in the different entries:

What is the network that sustains this object?
How does it evolve through time?
How does it create value? Or decrease / affect value?
What is its place/role in techno cultural strategies?
How does it relate to autonomous infrastructure?

The murkiness of autonomous AI image generation implicates our interests in the objects, too. Therefore, the objects are not just of 'interest', but also of ‘necessity’. The Cuban artist and designer, Ernesto Oroza speaks of “objects that are at the same time an understanding of a need and the answer to it.”[5] Oroza speaks of, for instance, the Cuban phenomenon S-net, short for Street network, a form of wireless network that is community driven, but occurs in a situation where people want to play online games or access the internet for other reasons, but where internet access is limited and regulated by the government. S-net is autonomous and independent, and yet, in order to exist, it also accepts the official demands of, for instance, not discussing politics online.[6]

If one asks what qualifies as autonomy in AI image generation, and the intent is to catalogue what autonomous AI image generation is and consists of, we answer by showcasing what it looks like, by necessity – because it always exists in relation to a fluctuation set of correspondences, conditions and dependencies. In much the same way: the catalogue of ‘Kepler Objects of Interest’ might reflect the potential of objects to be something, but what something is always also looks like something; like love that might look like 4,213 cigarette butts exhibited in the Museum of Innocence. In this sense, the catalogue is also always in flux, and with its unfinished nature, we also invite others to continue its edition.

Guestbook

CivitAI

Like Hugging Face, CivitAI is a collaborative hub for AI development. But unlike Hugging Face which supports a large range of applications, CivitAI is dedicated to image generation only. The platform attracts a huge number of enthusiasts and amateurs. In contrast to the rather serious interface and atmosphere of Hugging Face (that looks more like GitHub), the platform resembles art platforms such as DeviantArt where users display their portfolios. With its labyrinth of image galleries, it celebrates the capabilities of generative AI to clone every style and cross every genre from cartoon to oil painting and fashion photography to extreme pornography, as if the users were on a mission to exhaust pixel space.

But the platform attracts more than image makers. Many Civitans also upload custom-made models, LoRAs, VAEs, highly detailed tutorials. A large population of anime fans is responsible for an endless list of models that specialize on a given manga character, as well as many versions of the infamous PONY models that started as a project to generate better images of the characters of My Little Pony and that has evolved in a complex constellation of models able to produce reliable body poses.[1] This makes CivitAI a bridge between fans and computer geeks (who are sometimes both) who enjoy the platform's very lax sense of moderation that unfortunately does little to prevent various forms of abuse.[2] These models are made available on the platform's own image generator as well as for download. The largest share of models are available for free, therefore finding their ways on desktops for private use and in peer-to-peer networks such as Stable Horde for communal production.

CivitAI has a large infrastructure at its disposal. As users train models, LoRAs and VAEs on the platform and generate impressive amounts of images, CivitAI needs capital investment. As a centralized service (in contrast to Stable Horde), it supports its operations through various commercial offers. Additionally, it raises eye boggling venture capital investment. In 2023, the company raised $5.1 million backed by the firm Andreessen Horowitz (a16z).[3] All in all, the company exemplifies the tensions and paradoxes of an autonomous AI and its attachments. It does indeed serve the bottom-up production of models and add-ons as well as the 'democratization' of AI technology in a way that goes beyond mere consumer usage. But it does so by converting the labour of love of a large population of enthusiasts into capital. On the one hand, it makes possible a relative delinking from the dominant players of the market (such as OpenAI) and nourishes an ecosystem of small actors from amateurs to hackers. On the other, it does it at the condition of capital accumulation and complicity with the dark matter of American finance.

[1] PurpleSmartAI. “Pony Diffusion V6 XL.” CivitAI, March 6, 2025. https://civitai.com/models/257749/pony-diffusion-v6-xl.

[2] Wei, Yiluo, Yiming Zhu, Pan Hui, and Gareth Tyson. “Exploring the Use of Abusive Generative AI Models on Civitai.” 2024. https://arxiv.org/abs/2407.12876.

[3] Perez, Sarah. “Andreessen Horowitz Backs Civitai, a Generative AI Content Marketplace with Millions of Users.” TechCrunch, November 14, 2023. https://techcrunch.com/2023/11/14/andreessen-horowitz-backs-civitai-a-generative-ai-content-marketplace-with-millions-of-users/.

Clip

Like the variational autoencoder (VAE), the vision model CLIP (contrastive language-image pre-training) is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into embeddings, statistical representations, that can be operated upon in the latent space.[1]

The presence of CLIP in the pipeline illustrates the complexity of the relations between the various ecosystems of image generation. CLIP was first released in 2021 by OpenAI under an open source license,[2] just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license.[3] CLIP is in its own right a foundational model and serves multiple purposes such as image retrieval and classification. Its use as a secondary component in the image generation pipeline shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. If technically, CLIP bridges prompts and the latent space, politically it travels between proprietary and open source ecosystems.

Comparing CLIP to the VAE also shows how elements that perform similar technical functions allow for strikingly different forms of social appropriations. Amateurs train and retrain VAEs to improve image realism where as CLIP, that has been trained on four hundred million text-to-image pairs,[4] cannot be retrained without incurring exorbitant costs. Therefore, the presence of CLIP is due to its open licensing. The sheer cost of its production makes it a black box even for advanced users, and its inspection and customization out of reach.

[1] Offert, Fabian: On the Concept of History (in Foundation Models). In: IMAGE. Zeitschrift für interdisziplinäre Bildwissenschaft, Jg. 19 (2023), Nr. 1, S. 121-134.http://dx.doi.org/10.25969/mediarep/22316

[2] Openai. “GitHub - Openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the Most Relevant Text Snippet given an Image.” GitHub, n.d. Accessed August 22, 2025. https://github.com/openai/CLIP.

[3] Xiang, Chloe. “OpenAI Is Now Everything It Promised Not to Be: Corporate, Closed-Source, and For-Profit.” Vice, February 28, 2023. https://www.vice.com/en/article/openai-is-now-everything-it-promised-not-to-be-corporate-closed-source-and-for-profit/.

[4] Nicolas Malevé, Katrina Sluis; The Photographic Pipeline of Machine Vision; or, Machine Vision's Latent Photographic Theory. Critical AI 1 October 2023; 1 (1-2): No Pagination Specified. doi: https://doi.org/10.1215/2834703X-10734066

Currencies

₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ₺ ₻ ₼ ₽ ₾ ₿

Bozie, a DC elephant, and Swarna, a recent arrival from Calgary, get to know each other (photo by Gabrielle Emanuel).[1]

In some countries, Zoos' ethical conduct does not allow them to sell or buy animals to from zoos. Arguably, animals are an object of necessity in this context, however zoos are institutions that avoid poachers and animal hunters, which historically were the main sources for wild animals. Zoos and aquariums around the world do not think, in practice, of animals as commodities (i.e., objects that can be bought with money).[1] Animals are not for sale, yet zoos require to gather new animals. This apparent conundrum is usually solved through barter (exchange one 'product' for another one, instead of using money). Again, however, this is a burdensome system, as it is not always the case that an elephant is readily available to be traded for 20 jellyfish. The association of zoos and aquariums works through a system where a zoo donates an animal getting nothing in return. The animals are send to one of the zoos with space, need and facilities. But while there is no money in this transaction, the donors gain recognition within the zoo network, thus enhancing the possibilities of being the recipients of future donations.

Currencies take odd shapes. The zoos example above is one of the many tales of the 'semantic volatility' associated with currencies. While we use fiat (euros, crowns, etc) for most of our daily lives, social structures work on a sometimes invisible and highly complex mesh of systems of value and exchange. In the example above, in reality, giving up an elephant does not equal to zero gains, but what is gained is less quantifiable than a sum of money. Within our objects of interest and necessity, these unusual, implicit, and explicit exchanges, take the form of time, expertise, platformisation, and graphical processing power.

What is the network that sustains this object?

Most currencies are based on a network of interest that agrees to assign value to an object or system. In the case of zoos, the network is the zoos and aquariums agreed to assign value to donations and to ignore metrics or quantification (e.g., an elephant is neither an amount in euros nor equivalent to 50 lemurs). Plenty of alternative currencies rely on their own agreed system and work based on trust (either in the system or in the social structure).

Ad-hoc currencies are common in certain digital platforms, this means they can be used only in a certain ecosystem. For example, in-game currencies like 'Gold' in the popular mobile game Crash of Clans, can be earned and used only in that game. They allow for in-app or on-site monetisation, as they are commonly bought using legal tender (i.e. US dollars, Danish crowns, etc). Plenty of digital systems develop their own in-app currencies, either tied or untied to legal tender. The platforms in this cases usually dictate the rules of exchange, and the users (consumers and/or producers of content) generate and share the currencies in between them or through other objects of value.

On an organisational plane, within AI-oriented platforms, currencies tie the user, the producer, and the platform. "Buzz", the currency for CivitAI (one of the largest marketplaces for generative AI content), acts as a reward for the user's interaction with content, as a tip for content creators, and even as a 'bounty' for specific requests. The currency is controlled by the platform, the rules of production and legality of exchange are defined by CivitAI. As such, this currency can be purchased using fiat money directly in the platform, but it can also be earned awarded in different forms: reactions to content provides some buzz; if one's model is added to a collection, buzz is also generated for the owner of the model; the currency can also be freely tipped; some specific bounties or rewards can offer buzz for creating a very specific model or LoRA (for example, to remove watermarks[2]); or one can even beg for currency.

The listed options above are a peek to a microcosmos of social arrangements in a very specific platform of AI image generation. Buzz allows any CivitAI user to generate images. That is, this currency is exchanged for computational power, expertise, or a combination of both. Legal tender transforms into a community-value, where GPUs ownership and modelling knowledge and skills become highly valuable.

How does it create value? Or decrease / affect value?

Protocol as currency

During the late 2000s, the birth of cryptocurrencies sparked the imagination about how money could be different in the 21st century. The creation of Bitcoin, the first cryptocurrency, open the door for a type of currency that was, arguably, defined more by its system than by their users. Instead of relying on a central, trustable, institution, like a bank or a government, bitcoin off-phased trust and accountability to a mathematically-governed distributed system. Technically, the system would guarantee accurate transactions between any party, without any central management. Some rules were attached to its code, for example, programmed scarcity, but no traditional financial organisation was involved in the creation of the currency's rules.

The importance of code and protocol in this new type of digital currencies not only brought software to the main stage. Due to the high requirements of computational power in the blockchain design (the technology behind most cryptocurrencies), crypto miners (the computers that generate new coins) started requiring GPUs to be profitable. The equation for this was simple, the more computing power, the more chances to 'find' a coin (i.e., to solve a mathematical puzzle and generate a valid new block on the chain). While CPUs were able to process the required computation, GPUs architecture just made this process faster. The crypto industry has thus generated massive mining facilities with thousands of GPUs to profit by generating new coins, producing also a scarcity of this type of hardware. The unexpected relationship between digital currencies and the need for fast processing power suddenly made GPUs an important actor in the currencies landscape.

GPU as currency

With the expansion of LLMs and AI-orientated platforms, scarcity has moved once again towards hardware capable of training, operating, and fine-tuning LLMs. The boom of LLMs in the last 5 years started a race for developing and bringing to the market the most advanced LLMs. Tech giants like Microsoft, OpenAI, and Meta, compete by offering state-of-the-art models and integrating them into their software. That has made the GPU a holy grail of hardware, and had a strong effect for the manufacturers: Nvidia, the company that produces the most popular GPUs for training and gaming, was valued at US$3 trillion in 2023, and has surpassed the 4 trillion mark in 2025. A whole economy based on the production of hardware for text and image generation.

In the CivitAI example, Buzz is also highly related to access to a GPU. Much like with cryptocurrencies, the ownership of this type of hardware allows for exchange computing power for currency. However, the economies of LLMs are not restricted to big tech and platform-driven lives. On a different place within this spectrum, the hordeAI (see GPU and Stable Horde entries) network acts as a barter system of sorts, with its own currency. Named 'Kudos' can be earned by sharing one's GPU in the network, that is, lending a graphical device to produce images for someone within the network. Then kudos can be spend by using others GPU cards (perhaps better and with access to more demanding diffusion LLMs), through any interface connected to the network, and/or having priority in the generation queue. Kudos, in this sense, value reciprocity, and the imaginary of infrastructural autonomy outside of the big tech LLMs offers.

We share our GPU with the hordeAI, allowing for requests from other users to use our processing power. Thus, not only earning and spending kudos, but most importantly, participating in economies of sharing. Even though the hordeAI network is not a tight community (it is, factually, a network of GPU individual users), it allows us to think about currency in terms of materiality and reciprocity, and offers an insight to the possibilities of autonomy in an LLM saturated context.

[1] NPR, “Episode 566: The Zoo Economy,” Planet Money, September 5, 2014, https://www.npr.org/sections/money/2014/09/05/346105063/episode-566-the-zoo-economy.

[2] Civitai Community, “ADetailer Model to Remove Watermarks from SDXL Models,” Civitai, accessed August 12, 2025, https://civitai.com/bounties/1168/adetailer-model-to-remove-watermarks-from-sdxl-models.

[3] Wikipedia, s.v. “Bitcoin,” last modified August 12, 2025, https://en.wikipedia.org/wiki/Bitcoin.

[4] Macrotrends, “NVIDIA Market Cap 2010–2025 | NVDA,” Macrotrends, accessed August 12, 2025, https://www.macrotrends.net/stocks/charts/NVDA/nvidia/market-cap.

Dataset

In the context of AI image generation, a dataset is a collection of a collection of image-text pairs (and sometimes other attributes such as provenance or an aesthetic score) used to train AI models. It is an object of necessity par excellence. Without dataset, no model could see the light of day. Iconic datasets include the LAION aesthetic dataset, Artemis, ImageNet, or Common Objects in Context (COCO). These collections of images, mostly sourced from the internet, reach dizzying scales. ImageNet became famous for its 14 millions images in the first decade of the century.[1] Today LAION-5B consists of 5,85 billion CLIP-filtered image-text pairs. [2]

If large models such as Stable Diffusion require large scale datasets, various components such as LoRAs, VAEs, refiners, or upscalers can be trained with a much smaller amount of data. In practice, this means that for each of these components, a custom dataset is created. As each of these datasets reflects a particular aspect of visual culture, the components trained on them function as conduits for imaginaries and world views. Image generators are not simply produced through mathematics and statistics, they are programmed by images. Programming by images is a specific curatorial practice that involves a wide range of skills including a deep knowledge of the relevant visual domain, the ability to find the best exemplars, many practical skills such as scraping, image filtering, cleaning and cropping, and mastering the art of a coherent classification and annotation. In our tour, we discuss two examples of curatorial practices of different scales and purpose: the creation of the LAION dataset and the art of collecting the images that are necessary to "bake the LoRA cake."[3]

Further, behind each dataset there is an organisation - of people, corporate organisations, researchers, or others.[4] Even for individual users, collecting and sharing a dataset often means accepting and cultivating attachments to platforms. For instance, many datasets manually assembled by individuals are made freely available on platforms like Hugging Face, along with the large scale ones published by companies or universities, for others to build LoRAs or in other ways experiment with.

[1] Deng, Jia, Wei Dong, Richard Socher, Li-jia Li, Kai Li, and Li Fei-fei. “Imagenet: A Large-Scale Hierarchical Image Database.” CVPR 1 (2009): 248–55. https://doi.org/10.1109/CVPR.2009.5206848.

[2] Deng, Jia, Wei Dong, Richard Socher, Li-jia Li, Kai Li, and Li Fei-fei. “Imagenet: A Large-Scale Hierarchical Image Database.” CVPR 1 (2009): 248–55. https://doi.org/10.1109/CVPR.2009.5206848. Beaumont, Romain. “LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS.” LAION, March 31, 2022. https://laion.ai/blog/laion-5b/.

[3] knxo, “Making a LoRA Is Like Baking a Cake,” Civitai, published July 10, 2024, accessed August 18, 2025, https://civitai.com/articles/138/making-a-lora-is-like-baking-a-cake.

[4] JinsNotes. “Vision Dataset.” JinsNotes, August 1, 2024. Accessed August 26, 2025. https://jinsnotes.com/2024-08-01-vision-dataset.

Diffusion

Rather than a mere scientific object, diffusion is treated here as a network of meanings that binds together a technique from physics (diffusion), an algorithm for image generation, a model (Stable Diffusion), an operative metaphor relevant to cultural analysis and by extension a company (Stability AI) and its founder with roots in hedge fund investment.

In her text "Diffused Seeing", Joanna Zylinska aptly captures the multivalence of the term:

... the incorporation of ‘diffusion’ as both a technical and rhetorical device into many generative models is indicative of a wider tendency to build permeability and instability not only into those models’ technical infrastructures but also into our wider data and image ecologies. Technically, ‘diffusion’ is a computational process that involves iteratively removing ‘noise’ from an image, a series of mathematical procedures that leads to the production of another image. Rhetorically, ‘diffusion’ operates as a performative metaphor – one that frames and projects our understanding of generative models, their operations and their outputs.[1]

In complement to Zylinska's understanding of diffusion as a term operating at different levels with an emphasis on permeability, we inquire into the dialectical relation that opposes it to stability (as interestingly emphasized in the name Stable Diffusion), where the permeability and instability enclosed in the concept constantly motivates strategies of control, direction, capitalization or democratization that leverage the unstable character of diffusion dynamics.

What is the network that sustains this object?

From physics to AI, the diffusion algorithm

Our first move in this network of meanings is to follow the trajectory of the concept of diffusion from the 19th century laboratory to the computer lab. If diffusion had been studied since antiquity, Adolf Fink published the first laws of diffusion" based on his experimental work in 1855. As Wuhan and Princeton AI researchers Yuhan et al put it:

In physics, the diffusion phenomenon describes the movement of particles from an area of higher concentration to a lower concentration area till an equilibrium is reached. It represents a stochastic random walk of molecules.[2]

To understand how this idea has been translated in image generation, it is worth looking at the example given by Sohl-Dickstein and colleagues who authored the seminal paper on diffusion in image generation.[3] The authors propose the following experiment: take an image and gradually apply noise to it until it becomes totally noisy; then train an algorithm to 'learn' all the steps that have been applied to the image and ask it to apply them in reverse to find back the image (see illustration). By introducing some movement in the image, the algorithm detects some tendencies in the noise. It then gradually follows and amplifies these tendencies in order to arrive to a point where an image emerges. When the algorithm is able to recreate the original image from the noisy picture, it is said to be able to de-noise. When the algorithm is trained with billions of examples, it becomes able to generate an image from any arbitrary noisy image. And the most remarkable aspect of this process is that the algorithm is able to generalise from its training data: it is able to de-noise images that it never “saw” during the phase of training.

Another aspect of diffusion in physics is of importance in image generation can be seen at the end of the definition of the concept as stated in Wikipedia (emphasis is ours):

diffusion is the movement of a substance from a region of high concentration to a region of low concentration without bulk motion.[4]

Diffusion doesn't capture the movement of a bounded entity (a bulk, a whole block of content), it is a mode of spreading that flexibly accommodates structure. Diffusion is the gradual movement/dispersion of concentration within a body with no net movement of matter."[5] This characteristics makes it particularly apt at capturing multi level relations between image parts without having to identify a source that constraints these relations. It gives it access to an implicit structure. Metaphorically, this can be compared to a process of looking for faces in clouds (or reading signs in tea leaves). We do not see immediately a face in a cumulus, but the faint movement of the mass stimulates our curiosity until we gradually delineate the nascent contours of a shape we can identify.

the process of adding noise goes from left to right and the de-noising runs the process backwards to obtain the spiral back from noise.(Sohl-Dickstein et al., 2015) — The process of adding noise goes from left to right and the de-noising runs the process backwards to obtain the spiral back from noise.[6]

Stabilising diffusion

Diffusion as presented by Sohl-Dickstein and colleagues is at the basis of many current models for image generation. However, no user deals directly with diffusion as demonstrated in the paper.[7] It is encapsulated into software and a whole architecture mediates between the algorithm and its environment (see diagram of the process). For instance, Stable Diffusion is a model that encapsulates the diffusion algorithm and makes it tractable at scale. Rombach et al., the brains behind the Stable Diffusion model, popularize the diffusion technique by porting it in the latent space.[8] Instead of working on pixels, the authors performed the computation on compressed vectors of data and managed to reduce the computational cost of training and inference. They thereby popularised the use of the technique, making it accessible to a larger community of developers, and also added important features to the process of image synthesis:

By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.[9]

A diagram of AI image generation (Stable Diffusion), separating 'pixel space' from 'latent space' - what you see, and what cannot be seen and with an overview of the inference process (by Nicolas Malevé)

Because of this, diffusion can be guided by text prompts and other forms of conditioning inputs such as images, opening it up to multiple forms of manipulation and use such as inpainting. It stabilises diffusion in the sense that it allows for different forms of control. The diffusion algorithm in itself doesn't contain any guidance. This is an important step to move the algorithm outside of the worlds of github and tech tutorials into a domain where image makers can experiment with it. The pure algorithm cannot move alone.

But if diffusion is relatively stabilized in technical terms (input control and infrastructure), its adoption by increasingly large circles of users and developers has contributed to different forms of disruption for the best and the worst: parodies and deepfakes, political satyre and revenge porn. Once in circulation, it moves both as a technical product and as images.

Furthermore, rhetorically, it becomes a metaphor within a set of nested metaphors that include the brain as computer, concepts such as 'hallucinations' or deep 'dreams' that respond to a more general cultural condition. As Zylinska notes:

We could perhaps suggest that generative AI produces what could be called an ‘unstable’ or ‘wobbly’ understanding – and a related phenomenon of ‘shaky’ perception. Diffusion [...] can be seen as an imaging template for this model.[10]

Still according to Zylinska, this metaphor posits instability as an organizing concept for the image more generally:

Indeed, it is not just the perception of images but their very constitution that is fundamentally unstable.[11]

As a concept, it is in line with a general condition of instability due to the extensive disruptions brought on by the flows of capital. The wobbly, risky, financial and logistical edifice that supports Stable Diffusion's development testifies to this. The company Stability AI, funded by former edge fund manager Emad Mostaque, helped finance the transformation of the "technical" product into a software available to users and powered by an expensive infrastructure. It also made it possible to sell it as a service. To access large scale computing facilities, Mostaque raised $100 millions in venture capital.[12] His experience in the financial sector helped convince donors and secure the financial base. The investment was sufficient to give a chance to Stability to enter the market. Moving from the computer lab to a working infrastructure required to ground the diffusion algorithm into another material environment comprising Amazon servers, the JUWELS Booster supercomputer, tailor made data centers around the world.[13] This scattered infrastructure corresponds to the global distribution of the company's legal structure: one leg in the UK and one leg in Delaware. The latter offering a welcoming tax environment for companies. Dense networks of investors and servers supplement code. In that perspective, the development of the Stable Diffusion algorithm is inseparable from risk investment. These risks take the concrete form of a long string of controversies and lawsuits, especially for copyright infringement and the eventual firing of Mostaque from his position of CEO after aggressive press campaigns against his management. Across all its dimensions, the shaky nature of this assemblage mirrors the physical phenomenon Stable Diffusion's models simulate.

In short, stabilising diffusion means attending a huge range of problems happening simultaneously that require extremely different skills and competences such as algorithmic design, statistical optimization, identifying faulty GPUs, decide on batch sizes in training, and the impact of different floating-point formats on training stability, securing investment and managing delays in payment, pushing against legal actions, and, last but not least, aligning prompts and images.

How does diffusion create value? Or decrease / affect value?

The question of value needs to be addressed at different levels as we have chosen to treat diffusion as a complex of techniques, algorithm, software, metaphors and finance.

First, we can consider diffusion as an object concretised in a material form: the model. The model is at the core of a series of online platforms that monetize access to the model. With a subscription fee, users can generate images. Its value stems from the model's ability to generate images in a given style (i.e., Midjourney), with a good prompt adherence, reasonably fast. It is a familiar value form for models: AI as a service that generates revenue and capitalize on the size of a userbase.

As the model is open source, it can also be shared and used in different ways. For instance, users can use the model locally without paying a fee to Stability AI. Alternatively, it can be integrated in peer-to-peer systems of image generation such as Stable Horde or shared installations through non-commercial APIs. In this case, the model gains value with adoption. And as interest grows, users start to build things with it as LoRAs, bespoke models, and other forms of conditioning. Through this burgeoning activity, the model's affordances are growing. Its reputation increases as it enters different economies of attention where users gain visibility by tweaking it, or generating 'great art'.

In scientific circles, the model's value is measured by different metrics. Here, the object of necessity that travels across platforms and individual computers becomes an object of interest. What is at stake is a competition for scientific relevance where diffusion is a solution to a series of ongoing intellectual problems. Yet, we should not forget that computer science lives in symbiosis with the field of production and that many scientists are also involved in commercial ventures. For instance the above mentioned Robin Rombach gained a scientific reputation that can be evaluated through a citation index, but he was also involved in the company Stability AI. In the constant movement from academic research to production, the ability to experiment emerges as a shared value.^[1]. This is well captured by Patrick Esser, a lead researcher on diffusion algorithms, who defined the ideal contributor as someone who would “not overanalyze too much” and “just experiment” [14] The valorization of experimentation even justifies the open source ethos prevalent in the diffusion ecosystem:

“It’s not that we're running out of ideas, we’re mostly running out of time to follow up on them all. By open sourcing our models, there's so many more people available to explore the space of possibilities.” [15]

Finally, if we consider their impact on the currencies of images, diffusion-based algorithms contribute significantly to a decrease of the value of the singular image. If this trend started earlier and had been diagnosed several times (ie. Steyerl [16]), the capacity of models to churn out endless visual outputs has accelerated it substantially. As Munster and McKenzie wrote in their seminal piece "Platform Seeing," the value of the image ensemble (i.e., the model) grows at the expense of the singular image: "images both lose their stability and uniqueness yet gather aggregated force". [17] Their difference in value is implied in the algorithmic training process. To learn how to generate images, algorithms such as Stable Diffusion, Flux, Dall-e or Imagen need to be fed with examples. These images are given to the algorithm one by one. Through its learning phase, the algorithm treats them as one moment of an uninterrupted process of variation, not as a singular specimens. At this level, the process of image generation is radically anti-representational. It treats the image as a mere moment: a variation among many. Hence, it is the model that gains singularity.

What is its place/role in techno cultural strategies?

As a concept that traverses multiple dimensions of culture and technology, diffusion begs questions about strategies operating on different planes. In that sense, it constitutes an interesting lens to discuss the question of the democratization of generative AI. As a premise, we adopt the view set forth in the paper "Democratization and generative AI image creation: aesthetics, citizenship, and practices" [19] that the relation between genAI and democracy cannot be reduced either as one of apocalypse where artificial intelligence signals the end of democracy nor that we inevitably move towards a better optimized future where a more egalitarian world emerges out of technical progress. Both democracy and genAI are unaccomplished projects, and both are risky works in progress. Instead of simply lament genAI 's "use for propaganda, spread of disinformation, perpetuation of discriminatory stereotypes, and challenges to authorship, authenticity, originality", we should be see it as an opportunity to situate "the aestheticization of politics within democracy itself".[20] ^[2] In short, we think that the relation between democracy and genAI should not be framed as one of impact (where democracy as a fully achieved project pre-exists; AI's impact on democracy), but one where democracy is still to come. And, in the same movement, we should firmly oppose the view that AI is a fully formed entity awaiting to be governed, to be democratized. That is, the making of AI should in itself be an experiment in democracy. In this view, both entities inform each other. Diffusion as a transversal concept is a device to identify key elements of this mutual 'enactment'. They pertain to different dimensions of experience, sociality, technology and finance; to different levels of logistics and different scales. The dialectics of diffusion and stability we tried to characterize is therefore marked by loosely coordinated strategies that include (in no particular order):

providing concrete resources such as the model's weights and source code without fee and under a free license (democracy as equal access to resources)
producing and disseminating different forms of knowledge about AI: papers, code, tutorials (democratization of knowledge)
offering different levels of engagement: as a user of a service, as a dataset curator, as a LoRA creator, as a Stable Horde node manager (democratization as increase of participation)
freedom of use in the sense that the platform's censorship is up for debate or can be bypassed locally (democracy as (individual) freedom of expression and deliberation)

And, more polemically, the dialectics of diffusion and stability so far teach us the hard challenge to do these things under the constraints of the capitalist mode of production and its financial attachments.

[1] Joanna Zylinska, “Diffused Seeing: The Epistemological Challenge of Generative AI,” Articles, Media Theory 8, no. 1 (2024): 230, 1.

[2] Pei, Yuhan, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, and Yu Wu. “SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation.” 2024. https://arxiv.org/abs/2411.19182.

[3] Sohl-Dickstein, Jascha, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France), ICML’15, JMLR.org, 2015, 2256–65.

[4] Wikipedia, s.v. “Diffusion,” last modified August 12, 2025, https://en.wikipedia.org/wiki/Diffusion.

[5] “Diffusion.”

[6] Sohl-Dickstein et al., “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.”

[7] Sohl-Dickstein et al., “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.”

[8] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” arXiv preprint arXiv:2112.10752, last revised April 13, 2022, https://arxiv.org/abs/2112.10752.

[9] Rombach et al., “High-Resolution Image Synthesis with Latent Diffusion Models,” 1.

[10] Zylinska, “Diffused Seeing: The Epistemological Challenge of Generative AI,” 244.

[11] Zylinska, “Diffused Seeing: The Epistemological Challenge of Generative AI,” 247.

[12] Kyle Wiggers, “Stability AI, the Startup behind Stable Diffusion, Raises $101M,” Tech Crunch, October 17, 2022, https://techcrunch.com/2022/10/17/stability-ai-the-startup-behind-stable-diffusion-raises-101m/.

[13] Jülich Supercomputing Centre, JUWELS Booster Overview, accessed August 12, 2025, https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html> .

[14] Sophia Jennings, “The Research Origins of Stable Diffusion,” Runway Research, May 10, 2022, https://research.runwayml.com/the-research-origins-of-stable-difussion.

[15] Jennings, “The Research Origins of Stable Diffusion.”

[16] Hito Steyerl, “In Defense of the Poor Image,” in The Wretched Of The Screen (Sternberg Press, 2012).

[17] Adrian MacKenzie and Anna Munster, “Platform Seeing: Image Ensembles and Their Invisualities,” Theory, Culture & Society 36, no. 5 (2019): 3–22, https://doi.org/10.1177/0263276419847508.

[17] Maja Bak Herrie et al., “Democratization and Generative AI Image Creation: Aesthetics, Citizenship, and Practices,” AI & SOCIETY 40, no. 5 (2025): 3495–507, https://doi.org/10.1007/s00146-024-02102-y.

[18] Bak Herrie et al., “Democratization and Generative AI Image Creation: Aesthetics, Citizenship, and Practices,” 3497.

GPU (Graphics Processing Unit)

Image from the post A complete anatomy of a graphics card: Case study of the NVIDIA A100. [1]

Gefion – Denmark's new AI supercomputer, launched in October 2024 – is powered by 1,528 H100's, a GPU developed by Nvidia. This object, the Graphics Processing Unit, is a key element that, arguably, paves the way for Denmark's sovereignty and heavy ploughing in the AI world. Beyond all the sparkles, this photo shows the importance of the GPU object not only as a technical matter, but also a political and powerful element of today's landscape.

Since the boom of Large Language Models, Nvidia's graphic cards and GPUs have become somewhat familiar and mainstream. The GPU powerhouse, however, has a long history that predates their central position in generative AI, including the stable diffusion ecosystem: casual and professional gaming, cryptocurrencies mining, and just the right processing for n-dimensional matrices that translate pixels and words into latent space and viceversa.

What is a GPU?

A graphics processing unit (GPU) is an electronic circuit focused on processing images in computer graphics. Originally designed for early videogames, like arcades, this specialised hardware performs calculations for generating graphics in 2D and later for 3D. While most computational systems have a Central Processing Unit (CPU), the generation of images, for example, 3D polygons, requires a different set of mathematical calculations. GPUs gather instructions for video processing, light, 3D objects, textures, etc. The range of GPUs is vast, from small and cheap processors integrated into phones and smaller devices, to state of the art graphic cards piled in data centres to calculate massive language models.

What is the network that sustains the GPU?

From the earth to the latent space

Like many other circuits, GPUs require a very advance production process, that starts with mineral mining for both common and rare minerals (silicon, gold, hafnium, tantalum, palladium, copper, boron, cobalt, tungsten, etc). Their life-cycle and supply chain locate GPUs into a material network with the same issues of other chips and circuits: conflict minerals, labour rights, by-products of manufacturing and distribution, and waste. When training or generating AI responses, the GPU is the object that consumes the most energy in relation to other computational components. A home-user commercially available GPU like the GeForce RTX 5090 can consume 575 watts, almost double than a CPU in the same category. Industry GPUs like the A100 use a similar amount of energy, with the caveat that they are usually managed in massive data centres. Allegedly, the training of GPT4 used 25,000 of the latter, for around 100 days (i.e., approximately 1 gigawatt, or 100 million led bulbs). This places GPUs in a highly material network that is for the most part invisible, yet enacted with every prompt, LoRA training, and generation request.

Perhaps the most important part of the GPU , is that this object is a, if not the, translation piece: most of the calculations to and from 'human-readable' objects, like a painting, a photograph or a piece of text in pixel space, into an n-number of matrices, vectors, and coordinates, or latent space, are made possible by the GPU. In many ways, the GPU is a transition object, a door into and out of a, sometimes grey-boxed, space.

The GPU cultural landscape and its recursive publics

It is also a mundane object, historically used by gamers, thus owning much of its history and design to the gaming culture. Indeed, the development of computer games fuelled the design and development of the current GPU capabilities. In own our study, we use an Nvidia RTX 3060ti super, bought originally for gaming purposes. When talking about current generative AI, populated by major tech players and trillion-valued companies and corporations, we want to stress the overlapping history of digital cultures, like gaming and gamers, that shape this object.

Being a material and mundane object also allows for paths towards autonomy. While our GPU was originally used for gaming, it opened a door for detaching from Big Tech industry players. With some tuning and self-managed software, GPUs can process queries on open, mid-size and large, language models, including stable diffusion. That is, we can generate our own images, or train LoRA's, without depending on chatGPT, Copilot, or similar offers. Indeed, plenty of enthusiast follow this path, running, tweaking, branching, and reimagining open source models thanks to GPUs. CivitAI is a great example of a growing universe of models, with different implementations of autonomy: from niche communities of visual culture working on representation, communities actively developing prohibited, censored, fetishised, and specifically pornographic images. CivitAI hosts alternative models for image generation responding to specific cultural needs, like a specific manga style or anime character, greatly detached from the interests of Silicon Valley's AI blueprints or nation's AI sovereignty imaginaries.

A horde of graphic cards

The means of production of these communities, alongside with collaboratively labelled data, is the GPU. AI Horde (also known as Stable Horde), for example, is a distributed cluster of 'workers', i.e. GPUs, that use open models, including many community-generated ones from CivitAI. The volunteer and crowd-sourced project acts as a hub that directs image generation requests from different interfaces (such as ArtBot, Mastodon, or other platforms) towards individual GPU workers in the network. As part of our project, our GPU is (sometimes) connected to this network, offering image generation (using selected models) to any request from the interfaces (websites, APIs, etc).

"Why should I use the Horde instead of a free service?
Because when the service is free, you're the product!
Other services running on centralized servers have costs—someone has to pay for electricity and infrastructure. The AI Horde is transparent about how these costs are crowdsourced, and there is no need for us to change our model in the future. Other free services are often vague about how they use your data or explicitly state that your data is the product. Such services may eventually monetize through ads or data brokering. 

If you're comfortable with that, feel free to use them.
Finally, many of these services do not provide free REST APIs. If you need to integrate with them, you must use a browser interface to see the ads."[4]

Projects like this one show that there is an explicit interest in alternatives to the mainstream generative AI landscape, based on collaboration strategies rather than a surveillance/monetisation model, also known as "surveillance capitalism", an extractive model that has become the economic standard of many digital technologies (in many cases with the full support of democratic institutions).[5] In this sense, Horde AI is a project that deviates towards a form of technical collaboration, producing its own models of exchange in the form of kudos currency, which resembles a barter system, where the main material is GPU processing power.

But perhaps more importantly, Horde AI not only shows the necessity and the role of alternative actors and processes in the AI ecosystem, but also the importance of the cultural upbringings of the more wild AI landscape. In a similar fashion to the manga and anime background of the CivitAI population, AI horde is a project that evolved from groups interested in role-playing. The name horde reflects this imprinting, and the protocol comes from a previous project named "KoboldAI", in reference to the Kobold monster from the role-playin game Advanced Dungeons & Dragons. The material infrastructure of the GPU overlaps with a plethora of cultural layers, all with their own politics of value, collaboration, and ethics, influencing alternative imaginaries of autonomy. And much of the aspects of the recursive publics in this landscape are technically operationalized through the GPU object.

How did GPUs evolve through time?

The geometry engine

Artist and researcher Ben Gansky asks in a performance project: "Do graphic processing units have politics?" [7] He gracefully narrates 1961's Ivan Sutherland attempt to create an interactive and responsive relation to computers. Together with David Evans, they create a research group specialising in computer graphics. Evans and Sutherlands students would become founders of important digital images organisations like Pixar, Adobe, and Atari, among others.[8] One of these students, James H. Clark became a key piece at Xerox PARC, an innovation hub widely known for the creation of the graphical user interface (GUI), the computer mouse and desktop, electronic paper, and a long etcetera. Clark also pioneered the first GPU, called "the geometry engine" and founded Silicon Graphics Inc or SGI, a company focusing on producing 3D graphics workstations.[9]

3D blaster (voodoo) banshee graphic card.[12]

Plenty of companies followed and pushed the development of computer hardware for graphics. The first arcade systems and videogame consoles used ad-hoc hardware and software, eventually producing compatible hardware components known as video cards. The 1990's saw the adoption of real-time 3D graphic cards, such as the popular Voodoo models from 3dfx Interactive. A company that was eventually acquired by Nvidia in the 2000's.

To the moon

The title of Gansky's performance is a reference to Norbert Wiener classic paper "Do artifacts have politics?"[10] Winner asks about the inherent politics of technological implementations and design in the famous example of bridges too short to allow access to public transport, therefore acting as a techno-political device that subtly, but deterministically, controls access to particular demographics in New York city.

During the late 2000s GPUs became an unexpected bridge between computer enthusiasts and alternative economies. The idea of a digital currency detached from banks and other central institutions took shape in the form a of a decentralised protocol, Bitcoin, completely regulated by mathematical calculations. The invention of Bitcon, the first cryptocurrency based on blockchain technology, altered the offer and demand game for GPUs. Suddenly, this object was not only a gaming artefact, but also a calculation machine to 'mine' cryptocurrency, and for some, to become rich in a few years timeline. This made the GPU object not only a device to, technically, create economic value, but also a political means for production outside of the regular economic channels. In between libertarian and rebel economic dreams, the GPU became a bridge for a new imaginary of monetary autonomy.

The prediction engine

The design of the GPU was, although not by design, a also very good fit for the math problems machine learning researchers were facing, in particular for the training of neural networks. Parallel processing made GPUs a perfect candidate for this type of work, and Nvidia released an API in 2007 to allow researchers (as well as game developers) to expand the interaction and programmability of their GPU cards.

A canonical convolutional neural network for image classification, "alexnet", was trained using 2 GPUs in 2012. This network was co-developed by one of the founders and key figures of OpenAI, Ilya Sutskever, and inaugurated the mainstream use of GPUs for machine learning. Some years later, 10,000 of these objects of interest were used to train GPT-3 using a dataset of 3 trillion words. Its immediate successor, GPT-3.5 would be the first model behind the massively popular chatGPT interface. All major LLMs are trained in massive hubs of GPUs (usually cloud-based), and Nvidia now produces GPUs both aimed at casual and professional gamers, as well as the industrial AI market. Its A- and H-series, marketed towards cloud and datacentres for AI development and training, feed the current demand in the extended AI industry.

Today, the GPU is very much an object of necessity in the AI landscape. Governments, companies, and every institution that attempts to incorporate or participate in this technology requires access to GPUs in one form or another. The GPU, however, remains an object brought by distinct cultural needs, politics, and curiosity.

[1] Adil Lheureux, “A Complete Anatomy of a Graphics Card: Case Study of the NVIDIA A100,” Paperspace Blog, 2022, https://blog.paperspace.com/a-complete-anatomy-of-a-graphics-card-case-study-of-the-nvidia-a100/ .

[2] David Hogan, “Denmark Launches Leading Sovereign AI Supercomputer to Solve Scientific Challenges With Social Impact,” NVIDIA Blog, October 23, 2024, https://blogs.nvidia.com/blog/denmark-sovereign-ai-supercomputer/ .

[3] Design Life-Cycle, accessed August 18, 2025, http://www.designlife-cycle.com/.

[4] AI Horde, “Frequently Asked Questions,” accessed August 18, 2025, https://aihorde.net/faq.

[5] Shoshana Zuboff, “Surveillance Capitalism or Democracy? The Death Match of Institutional Orders and the Politics of Knowledge in Our Information Civilization,” Organization Theory 3, no. 3 (2022): 26317877221129290, https://doi.org/10.1177/26317877221129290.

[6] Kobold TSR card, accessed August 18, 2025. https://www.bonanza.com/listings/1991-TSR-AD-D-Gold-Border-Fantasy-Art-Card-403-Dungeons-Dragons-Kobold-Monster/1756878302?search_term_id=202743485

[7] Ben Gansky (director), Do Graphics Processing Units Have Politics?, video recording, December 15, 2022, https://www.youtube.com/watch?v=pK_mHfpug8I.

[8] Jacob Gaboury, Image Objects: An Archaeology of Computer Graphics (Cambridge, MA: The MIT Press, 2021), https://doi.org/10.7551/mitpress/11077.001.0001.

[8] J. H. Clark, “The Geometry Engine: A VLSI Geometry System for Graphics,” SIGGRAPH Computer Graphics16, no. 3 (1982): 127–133, https://doi.org/10.1145/965145.801272.

[10] Langdon Winner, “Do Artifacts Have Politics?” Daedalus 109, no. 1 (1980): 121–136.

[11] The Geometry Engine of James H. Clark (Wikimedia commons), accessed August 20, 2025.

[12] 3D blaster (voodoo) banshee graphic card (Wikimedia commons), accessed August 20, 2025. https://en.wikipedia.org/wiki/3dfx

[13] Bitcoin mining farm (Wikimedia commons), accessed August 20, 2025. https://en.wikipedia.org/wiki/Bitcoin

[14] NVIDIA, NVIDIA H100 Tensor Core GPU, accessed August 18, 2025, https://www.nvidia.com/en-us/data-center/h100/.

Hugging face

Hugging Face is a central cohesive source of support and stability when exploring autonomous AI image creation. It is, simply put, a collaborative hub for AI development – not specifically targeted at AI image creation, but generative AI more broadly (including speech synthesis, text-to-video. image-to-video, image-to-3D, and much more). It attracts amateur developers who use the platform to experiment with AI models, as well as professionals who use the expertise of the company or take the platform as an outset for entrepreneurship. By making AI models, datasets and also processing power widely available, it can be labelled as an attempt to democratise AI and delink from the key commercial platforms, yet at the same time Hugging Face is deeply intertwined with numerous commercial interests. It is therefore suspended between more autonomous and peer-based communities of practice, and a need for more 'client-server' relations in model training, which generally is dependent on 'heavy' resources (stacks of GPUs) and specialised expertise.

What is the network that sustains Hugging Face?

Hugging Face is a platform, but what it offers is more resembling an infrastructure for, in particular, training models. As such, Hugging Face is an object that operates in a space that is not typically seen by users of conventional generative AI. It is a pixel space for developers (amateurs or professionals) to use and interact with the computational models of a latent space (see Maps), and specify advanced settings for model training (see LoRA), but also to access a material infrastructure of GPUs.

Companies involved in training foundation models have their own infrastructures (specialised racks of hardware and expertise), but they may make their models available on Hugging Face. This includes both Stability AI, but also the Chinese DeepSeek, and others. Users often upload their own datasets to experiment with the many models in Hugging Face, and typically, these datasets are freely available on the platform for other users. But users also experiment in other ways. They ‘post train’ the models and create LoRAs for instance. Others create 'pipelines' of models, meaning that the outcome of one model can become the input for another model, At the time of writing there are nearly 500,000 datasets and 2,000,000 models freely available. Amidst this vertiginous expansion, there is a growing need to document the products exchanged on the platform and standardize the information. The uneven adoption of model cards describing the contents of a model emerges as a response to the needs of a lively community where users share their creations and experiences. It is safe to say that this community has fostered specialised developer knowledge of how to experiment with computational models.

Huggingface' community section reveals how user are predominantly with high developer expertise and knowledge of how AI models work, and how to employ or experiment with them.[1]

How has Hugging Face evolved through time?

Hugging Face initially set out in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf. Already in 2017 they received their first round of investment of $1,2 million[2], and as it was stated in the press release, Hugging Face is a "new chatbot app for bored teenagers. The New York-based startup is creating a fun and emotional bot. Hugging Face generates a digital friend so you can text back and forth and trade selfies."[3] In 2021 they received a $40 million investment to develop its "open source library for natural language processing (NLP) technologies." There were (in 2021) 10,000 forks (i.e., branches of development projects) and "around 5,000 companies [...] using Hugging Face in one way or another, including Microsoft with its search engine Bing."[4]

This trajectory shows how the company has gradually moved from providing a service (a chatbot) to becoming a major (if not the) platform for AI development – now, not only in language technologies, but also (as mentioned) in speech synthesis, text-to-video. image-to-video, image-to-3D and much more. But it also shows an evolution of generative AI. Like today, early ChatGPT (developed by OpenAI and released in 2022) used large language models (LLMs), but offered very few parameters for experimentation: the prompt (textual input) and the temperature (the randomness and creativity of the model's output). Today, there are all kinds of components and parameters. This also explains the present-day richness of Hugging Face’ interface: many of the commercial platforms do not offer this richness, and an intrinsic part of the delinking from them seems to be attached to a fascination of settings and advanced configurations (see also Interfaces).

Screenshot of huggingface.co, 2017. At the time, the company was entirely focused on building a new chatbot app for teenagers.

Screenshot of huggingface.co, 2025. Today the company offers access to millions of models and datasets for its users to experiment and develop with.

How does Hugging Face affect the creation of value?

Hugging Face has an estimated market value of $4.5 billion (as of 2023).[5] What does the exorbitant value of a platform unknown from the general public reflect?

On the one hand, the company has capitalised on the various communities of developers in, for instance, image and vision who experiment on the platform and share their datasets and LoRAs, but this is only a partial explanation.

Hugging Face is not only for amateur developers. On the platform one also finds an 'Enterprise Hub' where Hugging Face offers, for instance, advanced computing at higher scale with a more dedicated hardware setup ('ZeroGPU', see also GPU), and also 'Priority Support'. For this more commercial use, access is typically more restricted. In this sense, the platform has become innately linked to a plane of business innovation and has also teamed up with Meta to boost European startups in an "AI Accelerator Program".[6]

Notably, Hugging Face also collaborates with other key corporations in the business landscape of AI. For instance, Amazon Web Services (ASW), allowing users to make the trained models in Hugging Face available through Amazon SageMaker.[7] Nasdaq Private Market also lists a whole range of investors in Hugging Face (Amazon, Google, Intel, IBM, NVIDIA, etc.).[8]

The excessive (and growing) market value of Hugging Face reflects, in essence, the high degree of expertise that has accumulated within a company that consistently has sought to accommodate a cultural community, but also a business and enterprise plane of AI. Managing an infrastructure of both hardware and software for AI models at this large scale is a highly sought expertise.

A diagram by the European Business review representing Hugging Face business model.[9]

What is the role of Hugging Face in techno-cultural strategies?

Regardless of the Enterprise Hub, Hugging Face also remains a hub for amateur developers who do experimentation with generative AI, beyond what the commercial platforms conventionally offer – and also share their insights in the platforms's 'Community' section. An example is the user 'mgane' who has shared a dataset of "76 cartoon art-style video game character spritesheets." The images are "open-source 2D video game asset sites from various artists." mgane has used them on Hugging Face to build LoRAs on Stable Diffusion, that is "for some experimental tests on Stable Diffusion XL via LORA and Dreambooth training methods for some solid results post-training."[10]

One of "76 cartoon art-style video game character spritesheets."

A user like mgane is arguably both embedded in a specific 2D gaming culture, and also has the developer skills necessary to access and experiment with models in the command line interface. However, users can also access the many models in Hugging Face through more graphical user interfaces like Draw Things that allows for accessing and combining models and LoRAs to generate images, and also to train one's own LoRAs (see Interfaces).

How does Hugging Face relate to autonomous infrastructures?

Looking at Hugging Face, the separation of community labour from capital interests (i.e., 'autonomy') in generative AI does not seem to be an either-or. Rather, the dependencies of autonomous generative AI seem to be in a constant movement, gravitating from 'peer-to-peer' communities, towards 'client-server relations' that are more easily capitalised. This may be due to the need for infrastructures that demand a high level of expertise and technical requirements involved in generative AI, but is not without consequence.

When, as noted by the European Business Review, most tech-companies in AI want to collaborate with Hugging Face, it is because the company offers an infrastructure for AI.[11] Or, rather, it offers a platform that performs as an infrastructure for AI – a "linchpin" that keeps everything in the production in position. As also noted by Paul Edwards, a platform seems to be, in a more general view, the new mode of handling infrastructures in the age of data and computation.[12] Working with AI models is a demanding task that requires both expertise, hardware and organisation of labour, and what Hugging Face offers is speed, reliability, and not least agility in a world of AI that is in constant flux, and where new models and techniques are introduced almost at a monthly basis.

With their 'linchpin status' Hugging Face build on already existing infrastructures such as the flow of energy or water, necessary to make the platform run. They also rely on social and organisational infrastructures, such as those of both start-ups and cultural communities. At the same time, however, they also reconfigure these relations – creating cultural, social and commercial dependencies on Hugging Face as a new 'platformed' infrastructure for AI.

[1] Aryan V S (@a-r-r-o-w), “Caching Is an Essential Technique Used in Diffusion Inference Serving,” Hugging Face, last modified August 2025, https://huggingface.co/posts/a-r-r-o-w/278025275110164>

[2] “Hugging Face,” Wellfound, accessed August 11, 2025, https://wellfound.com/company/hugging-face/funding.

[3] Romain Dillet, “Hugging Face Wants to Become Your Artificial BFF,” TechCrunch, March 9, 2017, https://techcrunch.com/2017/03/09/hugging-face-wants-to-become-your-artificial-bff/.

[4] Romain Dillet, “Hugging Face Raises $40 Million for Its Natural Language Processing Library,” TechCrunch, March 11, 2021, https://techcrunch.com/2021/03/11/hugging-face-raises-40-million-for-its-natural-language-processing-library/.

[5] “Hugging Face,” Sacra, accessed August 11, 2025, https://sacra.com/c/hugging-face/.

[6] “META Collaboration Launches AI Accelerator for European Startups,” Yahoo Finance, March 11, 2025, https://finance.yahoo.com/news/meta-collaboration-launches-ai-accelerator-151500146.html.

[7] Hugging Face. “Amazon.” Hugging Face. Accessed August 11, 2025. https://huggingface.co/amazon.

[8] “Hugging Face,” Nasdaq Private Market, accessed August 11, 2025, https://www.nasdaqprivatemarket.com/company/hugging-face/.

[9] “Hugging Face: Why Do Most Tech Companies in AI Collaborate with Hugging Face?” The European Business Review, accessed August 11, 2025, https://www.europeanbusinessreview.com/hugging-face-why-do-most-tech-companies-in-ai-collaborate-with-hugging-face/.

[10] mgane, “2D_Video_Game_Cartoon_Character_Sprite-Sheets,” Hugging Face, accessed August 11, 2025, https://huggingface.co/datasets/mgane/2D_Video_Game_Cartoon_Character_Sprite-Sheets.

[11] “Hugging Face: Why Do Most Tech Companies in AI Collaborate with Hugging Face.”

[12] Paul N. Edwards, “Platforms Are Infrastructures on Fire,” in Your Computer Is on Fire, ed. Thomas S. Mullaney, Benjamin Peters, Mar Hicks, and Kavita Philip (Cambridge, MA: MIT Press, 2021), 197–222. https://doi.org/10.7551/mitpress/10993.003.0021

Guestbook

Extension:EtherpadLite: -- Error: The pad "https://ctp.cc.au.dk/pad/p/Objects_of_interest_and_necessity_Guestbook" has already been used before on this page; you can have many pads on a page, but only if they are different pads.

Interfaces to autonomous AI

Interfaces to generative AI come in many forms. There are graphical user interfaces to the models of generative AI; interfaces between the different types of software, as for instance an API (Application Programming Interface) where one can integrate a model into other software; and on a material plane, there are also interfaces to the racks of servers that run the models, or between them.

What is of particular interest here – when navigating the objects of interest and necessity – is, however, the user interface to autonomous AI image generation: the ways in which a user (or developer) accesses the 'latent space' of computational models (see Maps). A computational model is not visible as such. Therefore, the user's first encounter with AI is typically through an interface that renders the flow of data tangible in one form or the other. How does one access and experiment with Stable Diffusion and autonomous AI?

What is the network that sustains the interface?

Most people who have experience with AI image creation will have used flagship generators such as Microsoft's Bing Image Creator, OpenAI's DALL-E or Adobe Firefly. Here, the image generator interface is often integrated into other corporate services. Bing, for instance, is not merely a search engine, but also integrates all the other services offered by Microsoft, including Microsoft Image Creator. The Image Creator is, as expressed in the interface itself, capable of making the users "surprised", "inspired", or for them to "explore ideas" (i.e., be creative). There is, in other words, an expected affective smoothness in the interface – a simplicity and low entry threshold for the user that perhaps also explains the widespread preference for these commercial platforms. What is noticeable in this affective smoothness (besides the integration into the platform universes of Microsoft, Adobe or OpenAI), is that users are offered very few parameters in the configuration; basically, the interaction with the model is reduced to a prompt. Interfaces to autonomous AI vary significantly from this in several ways.

A screen shot Microsoft Image Creator's interface, encouraging the user to 'be surprised' and 'explore ideas'. Elsewhere it also promises that the user will 'get inspired'.

First of all, not all of them offer a web-based interface. The interfaces for generating images with Stable Diffusion therefore also vary, and there are many options depending on the configuration of one's own computer. ComfyUI, for instance is commonly used with models you can run locally and employs a node based workflow, making it particularly suitable to visually reproduce 'pipelines' of models (see also Hugging Face). It works for both Windows, MacOS and Linux users. Draw Things is suitable for MacOS users. ArtBot is another example that has a web interface as well as integration with Stable Horde, allowing users to generate images in a peer-based infrastructure of GPUs (as an alternative to the commercial platforms' cloud infrastructure).

A screen shot of ArtBot's interface to Stable Horde.

Secondly, in autonomous AI image creation you find a great variety of settings and configurations. To generate an image, there is naturally a prompt, but also the option of adding a negative prompt (instructions on what not to include in the generated image). One can also combine models, e.g., use a 'base model' of Stable Diffusion, and add LoRA's (one's own or imported from CivitAI or Hugging Face). There is also the option of determining how much weight the added models should have in the generation of the image, the size of the image, or a 'seed' that allows for variations (of style, for instance) while maintaining some consistency in the image, and plenty of more parameters to experiment with.

Thirdly, like the commercial platforms, also interfaces to autonomous AI offer integration into other services, but with much less restriction. Stability AI, for instance, offers an application programming interface (API), meaning a more command-line interface that allows developers to integrate image generation capabilities into their own applications. Likewise, Hugging Face (a key hub for AI developers and innovation) provides an array of models that are released with different licences (some more open, some more restricted for, say, commercial use) and which can be integrated into new tools and services.

Fourthly, many of the interfaces are not just for generating images using the available models and settings. The visual cultural preferences for, say, a particular style of manga also leads to further complexity in the user interface. That is, interfaces like Draw Things and ComfyUI are simultaneous also interfaces for training one's own models (i.e., to build LoRAs), and possibly making them available on CivitAI or a similar platform, so that others can use them if they have the same affinity for this particular visual style.

In short, interfaces to autonomous AI are open-ended in multiple ways, and are typically not only between use ('pixel space') and model ('latent space'), but simultaneously also between the models and a developer space that ordinary users typically do not see. This doubleness allows users to specify, in detail, the visual outputs, including to combine models, or even build their own. It also allows for specific requirements on a material plane, such as the use of one's own GPU or a collectively shared GPU in Stable Horde's distributed network.

How do interfaces to autonomous AI create value?

Using a commercial platform, one quickly experience a need for 'currencies'. In Microsoft Image Creator, for instance, there are 'credits' that allow users a front pocket to the GPU to speed up an otherwise slow process of generating an image. These credits are called Microsoft Reward points, and are basically earned by either waiting (as a punishment) or being a good customer, who regularly uses Microsoft's other products. One earns, for instance, points for searching through Bing, using Windows search box, buying Microsoft products, and playing games in Microsoft Xbox. Use is in other words intrinsically related a plane of business and value creation that capitalises on the use of energy and GPU on a material plane (see Maps).

Like the commercial platform interfaces, also the interfaces for Stable Diffusion relies on a 'business plane' that organises the access to a material infrastructure, but they do so in very different ways. For instance, Draw Things allows users to generate images on their own GPU (on a material plane), without the need for currencies. And, with ArtBot it is possible to connect to Stable Horde, accessing the processing power of a distributed network. Here, users are also allowed a front pocket, but this is not granted on their loyalty as 'customers', but in their loyalty to the peer-network. Allowing other users to access one's GPU will be rewarded with 'Kudos' that can then be used to skip the waiting line when accessing other GPUs. A free ride is in this sense only available if the network makes it possible.

What is their role in techno cultural strategies?

The commercial platform interfaces for AI image creation are sometimes criticised for their biases or copyright infringements, but many people see them as useful tools.[1] They can be used by, for instance, creatives to test out ideas and get inspired. Frequently, they are also used in teaching and communication. This could be for illustration – as an alternative to, say, an image found in the internet, where use might otherwise violate copyrights. It is increasing also used to display complex ideas in an illustrative way. Often, the model will fail or reveal its cultural biases in this attempt, and at times (perhaps even as a new cultural genre), presentations also include the failed attempts to ridicule the AI model and laugh at how it perceives the complexity of illustrating an idea (see the discussion of reflexive prompting in the prompt entry).

By its many settings, interfaces to autonomous AI accommodate a much more fine grained visual culture. As previously mentioned, this can be found on sites such as CivitAI or Danbooru. Here one finds a visual culture that not only is invested in, say, manga, but often also in LoRAs. That is, on CivitAI there are images created with the use of LoRAs to generate a specific stylistic outputs, but also requests to use specific LoRAs to generate images.

The complex use of interfaces testifies how highly skilled the practitioners are within the interface culture of autonomous AI image creation: when generating images, one has to understand how to make visual variations using 'seed' values, or how to make use of Stable Horde using Kudos (currencies) to speed up the process; when building and annotating datasets for LoRAs, and creating 'trigger words', one has to understand how this ultimately relates to how one prompts when generating images using the LoRA; when setting 'learning rates' (in training LoRAs), one has to understand the implications for the use of processing power and energy; and so on. In other words, to operate the interface does not only demand a high knowledge of visual culture, but also deep insights into how models work, and the complex interdependencies of different planes of use, computation, social organisation, value creation, and material infrastructure.

How do interfaces relate to autonomous infrastructures?

To conclude, interfaces to autonomous AI image generation seem to rely on a need for parameters and configurations that accommodates particular and highly specialised visual expressions, but also gives rise to a highly specialised interface culture that possesses deep insights into not only visual culture, but also the technology. Such skills are rarely afforded in the smooth commercial platforms that overflows visual culture with an abundance of AI generated images. Interfaces to autonomous AI sometimes also build in a decentralisation of processing power (GPU), either by letting users process the images in their own computers, or by accessing a peer-network of GPUs. Despite this decentralisation, interfaces to autonomous AI are not detached from commercial interests and centralised infrastructures. The integration of and dependency on a platform like Hugging Face is a good example of this.

[1] Analytics Insight, “What Is Bing AI Image Creator,” Medium, August 10, 2023, https://medium.com/@analyticsinsight/what-is-bing-ai-image-creator-ba8ac1e8eb1e.

LAION

If our tour has led us into well-funded companies such as Hugging Face or CivitAI and their attachments in the heart of venture capital, it also leads us, at the opposite end of the financial spectrum to significant actors that operate within a minimal economy such as Stable Horde. The Large-scale Artificial Intelligence Open Network (LAION) fits in this category. It is a non-profit organization whose ambition is to democratize AI by encouraging "open public education and a more environment-friendly use of resources by reusing existing datasets and models."[1] LAION operates with small donations in the form of money but mostly in terms of cloud compute.[2]

LAION's logo [3]

LAION's co-funder, Christoph Schuhmann, is the driving force behind one major object of necessity for the generative AI ecosystem: a series of datasets that outscaled the existing offer. The curatorial method for these datasets was entirely automated. It leveraged cleverly available resources such as Common Crawl and Google Collab to download text-image pairs en masse from the internet. This curatorial method differs radically from the practice of affective involvement discussed in the LoRA entry where anime enthusiasts select images by hand from a visual domain they cherish. It also contrast with the method used in earlier large scale datasets such as ImageNet where the annotation work was manually performed and crowdsourced. In the case of LAION5B that contains 5.85 billion images, Schuhmann and his collaborators used an index of webpages compiled by the non-profit Common Crawl to find html documents with <img> tags, and extract their Alt Text (Alt Text is a descriptive text acts as a substitute for visual items on a page, and is sometimes included in the image data to increase accessibility). The work of annotation is delegated to the then just-released CLIP model tasked to verify the relation between the downloaded images and the adjacent alt-text.[4] The comparison is even more striking with a subsequent dataset, LAION-Aesthetics, consisting in a subset of the 5 billions images dataset that contains images with higher aesthetic quality.[5] This object of high interest for the newly burgeoning field of image generation, that desperately looked for stylistically rich images to train algorithms, was assembled using an approach that again favoured integral automation. This time the selection was handled by a custom-made model trained on clip embeddings to evaluate the quality of images by attaching them an aesthetic score.

This can be explained by the fact that LAION operates with a minimal budget and could not afford the cost of manual verification and annotation of a dataset of that scale. But in the case of LAION, the automation of curation did not preclude artisanal practice. It displaced it. An interview given by Schuhmann shows the ad-hoc and low-tech nature of the bricolage that presided the creation of an object that helped sparked the development of image generation [6]:

“Then in the spring of 2021, I sat down and just wrote down a huge spaghetti code in a Google Colab and then asked around on Discord who wanted to help me with it. Someone got in touch, who later turned out to be only 15 at the time. And he wrote a tracker, basically a server that manages lots of colabs, each of which gets a small job, extracts a gigabyte, and then uploads the results. At that time, the first version was still using Google Drive.”[7]

“We then did a [blog post about our dataset](https://laion.ai/blog/laion-400-open-dataset/), and after less than an hour, I already had an email from the Hugging Face people wanting to support us. I had then posted on the Discord server that if we had $5,000, we could probably create a billion image-text pairs. Shortly after, someone already agreed to pay that: “If it’s so little, I’ll pay it.” At some point, it turned out that the person had his own startup in text-to-image generation, and later he became the chief engineer of Midjourney.”[8]

These two fragments are worth quoting at length. In them, Schuhmann traces a line that goes from the management of the limits of user accounts on collab and Google Drives, the informality of meeting a coder on Discord (who ends up being a teenager) and the future chief engineer of a major company of the field. These anecdotes indicate how the dataset functions as an attractor for actors and projects of radically different scales and funding.

In the dataset entry, we characterized the datasets as conduits of visual culture entering a model. Examining the controversies surrounding LAION, we have to underline how these conduits problematically enable the reigning extractivism of the AI industry. Indeed, the curatorial method devised by LAION does not include seeking permissions from the images' rights owners and several court cases are currently led by artists and image agencies against Stability AI and others on the ground that their use of the images contained in the LAION dataset is infringing.[9] Here the non-profit status of the organization plays an ambiguous role. For Schuhmann, his association benefits from an exception granted by the EU Data Mining directive for scientific research.[10] If this is true for LAION itself, the same can't be said for the parties interested in the object. If the dataset is an object of necessity for Stability AI and Midjourney as much as for Stable Horde or the individual users generating images with their models, the images it contains are also objects of necessity for the artists who produced them. What the example of LAION reveals is that even if their collections of images are sites of convergence for actors and projects of different scales and means, they are at the same time sites of divergence for their authors who have radically different interests.

[1] “LAION.” LAION, n.d. Accessed August 22, 2025. https://laion.ai/.

[2] Schuhmann, Christoph. “AI as a Superpower: LAION and the Role of Open Source in Artificial Intelligence.” MLCon, June 21, 2023. https://mlconference.ai/blog/ai-as-a-superpower-laion-and-the-role-of-open-source-in-artificial-intelligence/.

[3] “LAION.” LAION, n.d. Accessed August 22, 2025. https://laion.ai/.

[4] Beaumont, Romain. “LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS.” LAION, March 31, 2022. https://laion.ai/blog/laion-5b/.

[5] Schuhmann, Christoph. “LAION-AESTHETICS.” LAION, August 16, 2022. https://laion.ai/blog/laion-aesthetics/.

[6] Schuhmann, “AI as a Superpower: LAION and the Role of Open Source in Artificial Intelligence.”

[7] Ibid.

[8] Ibid.

[9] Andersen et al v. Stability AI Ltd. et Al, 3:2023cv00201 (US District Court for the Northern District of California 2023). https://dockets.justia.com/docket/california/candce/3:2023cv00201/407208.

[10] Moody, Glyn. “German Court: LAION’s Generative AI Training Dataset Is Legal Thanks To EU Copyright Exceptions.” Techdirt, October 25, 2024. https://www.techdirt.com/2024/10/25/german-court-laions-generative-ai-training-dataset-is-legal-thanks-to-eu-copyright-exceptions/.

Latent space

In contrast to pixel space where users engage with AI images perceptually, the latent space is an abstract space internal to a generative algorithm such as Stable Diffusion. It can be represented as a transitional space between the collection of images in a datasets and the generation of new images. In the latent space, the dataset is translated into statistical representations that can be reconstructed back into images. As explained by Joanna Zylinska:

In Stable Diffusion, it was the encoding and decoding of images in so-called ‘latent space’, i.e., a simplified mathematical space where images can be reduced in size (or rather represented through smaller amounts of data) to facilitate multiple operations at speed, that drove the model’s success.[1]

It is useful to think about that space as a map. As Estelle Blaschke, Max Bonhomme, Christian Joschke and Antonio Somaini explain:

A latent space consists of vectors (series of numbers, arranged in a precise order) that represent data points in a multidimensional space with hundreds or even thousands of dimensions. Each vector, with n number of dimensions, represents a specific data point, with n number of coordinates. These coordinates capture some of the characteristics of the digital object encoded and represented in the latent space, determining its position relative to other digital objects: for example, the position of a word in relation to other words in a given language, or the relationship of an image to other images or to texts.[2]

The relation between datasets and latent space is a complex one. A latent space is a translation of a given training set. Therefore, in the process of model training, datasets are central and various factors such as the curation method or scale have a great impact on the regularities that can be learned and represented into vectors. But a latent space is not a dataset's literal copy. It is a statistical interpretation. A latent space gives a model its own identity. As WetCircuit (AKA Cutscene Artist), a prominent user and author of tutorials of the Draw Things app puts it, a model is not "bottomless." [3] This is due to the fact that the model's latent space is finite and therefore biased. This is an important argument for the defense of a decentralized AI ecosystem that ensures a diverse range of worldviews.

The multiplication of models (and therefore their latent spaces) produces various kinds of relations between them. In current generation pipelines, images are rarely the pure products of one latent space. In fact, there are different components intervening in image generation. Small models such as LoRAs add capabilities to the model. The software CLIP is used to encode user input. Other components such as upscalers add higher resolution to the results. There are many latent spaces involved, each with their own inflections, tendencies and limitations. Another important form of relation occurs when one model is trained on top of an existing one. For instance, as Stable Diffusion is open source, many coders have used it as a basis for further development. In CivitAI, on a model page, there is a field called Base Model which indicates the model's "origin", an important information for those who will use it as they are entitled to expect a certain similarity between the model's behaviour and its base. As models are retrained, their inner latent space is modified, amplified or condensed. Their internal map is partially redrawn. But the resulting model retains many of the base model's features. The tensions between what a new model adds to the latent space and what it inherits is explored further in the LoRA entry.

The abstract nature of the latent space makes it difficult to grasp. The introduction of techniques of prompting, text-to-image, made the exploration of latent space using natural language possible. And the ability to use images as input to generate other images, image-to-image, has opened a whole field of possibilities for queries that may be difficult to formulate in words. While in pixel space, images and texts belong to different perceptual registers and relate to different modes of experience of the world, things change in latent space. Once encoded as latent representations, they are both treated as vectors and participate smoothly to the same space. This multi-modal quality of existing models is in part possible because other components such as the variational autoencoder and CLIP can transform various media such as texts and images into vectors. And it is the result of a decade of pre-existing work on classification and image understanding in computer vision where algorithms learned how a tree is different from a lamp post, or a photo is different from a 18th century naturalist painting.

For a discussion of the implications of latent space, see also the entries diffusion, LoRA and maps.

[1] Joanna Zylinska, “Diffused Seeing: The Epistemological Challenge of Generative AI,” Media Theory8, no. 1 (2024): 229–258, https://doi.org/10.70064/mt.v8i1.1075.

[2] Blaschke, Estelle, Max Bonhomme, Christian Joschke, and Antonio Somaini. “Introduction. Photographs and Algorithms.” Transbordeur 9 (January 2025). https://doi.org/10.4000/13dwo.

[3] wetcircuit. "as a wise man once said, there's not really ONE true model. They each have their 'look' and their quirks, they are not bottomless wells, so often we switch models to get fresh inspiration, or because one is very good at clean illustration while another is very cinematic..." Discord, general-chat, Draw Things Official, August 8, 2025.

LoRA

On his personal page on the CivitAI website, the user BigHeadTF promotes his recent creation, a small model called "The Incredible Hulk (2008)." Compared to earlier movies of the Hulk, the 2008 version shows a tormented Bruce Banner who transforms into a green creature with "detailed musculature, dark green skin, and an almost tragic sense of isolation".[2] The model helps generate characters resembling this iconic version of Hulk in new images.

To demonstrate the capabilities of his model, BigHeadTF has selected a few pictures he created with the LoRA. Hulk is in turn depicted as cajoling a teddy bear or crossdressing as Shrek's Princess Fiona. The images play with the contrast between Hulk's overblown virility and childlike or female connotations. The images demonstrate the model's ability to expand the hero's universe into other registers or fictional worlds. "The Incredible Hulk (2008)" doesn't just reproduce faithfully existing images of Hulk, it also opens new avenues for creation and combinations for the green hero.

This blend of pop and remix culture that strives on the blurring of boundaries between genres infuses a large number of creations made with generative AI. However BigHeadTF shares more than images, he also offers the software component that makes his images distinctive. The model he distributes on his page is called a LoRA. The most famous models such as Stable Diffusion or Flux are rather general-purpose. These 'base' or 'foundation' models can be used to generate images in many styles and can handle a huge variety of prompts. But they may show limitations when a user wants a specific output such as a particular genre of manga, a style that emulates black and white film noir or when an improvement is needed for some details (specific hands positions, etc) or to produce legible text. This is where LoRAs come in. A LoRA is a smaller model created with a technique that makes it possible to improve the performance of a base model on a given task.

A technical primer

Initially developed for LLMs, the Low-Rank Adaptation (LoRA) technique is a fine-tuning method that freezes an existing model and inserts a smaller number of weights to adjust the model's behaviour to a particular need. Instead of a full retraining of the model, LoRAs only require the training of the weights that have been inserted in the model's attention layers.[3] Therefore LoRAs are quite lightweight and able to leverage the capabilities of larger models. Users equipped with a consumer-grade GPU can train their own LoRAs reasonably fast (on a mac M3, a LoRA can be produced in 30 minutes). LoRAs are quite popular within communities of amateurs and developers alike. At the time of writing, the AI platform Hugging Face lists 71,312 LoRAs.

What is the network that sustains this object?

Making a Lora is like baking a cake, a post by knxo on CivitAI.[4]

The process of LoRA training is very similar to training a model, but at a different scale. Even if it requires dramatically less compute, it still involves the same kind of highly complex technical decisions. In fact, training a LoRA mobilizes the whole network of operation of decentralized image generation and offers a privileged view on its mode of production.

Software dependencies

Various layers of software libraries tame this complexity. A highly skilled user can train a LoRA locally with a series of scripts like kohya_ss and pore through the vertiginous list of options. Platforms like Hugging Face distribute software libraries (ie. peft) that abstract away the complex integration of the various components such as LoRAs in the AI generation pipeline. And for those who don't want to fiddle with code or lack access to a local GPU, the option of training LoRA are offered by websites such as Runway ML, Eden AI, Hugging Face or CivitAI for different price schemes.

LoRA as a contact zone between communities with different expertise

"Making a LoRA is like baking a cake," says a widely read tutorial, ' "a lot of preparation, and then letting it bake. If you didn't properly make the preparations, it will probably be inedible."[5] To guide the wannabe LoRA creator in their journey, a wealth of tutorials and documentation in various forms are available from sources such as subreddits, Discord channels, YouTube videos, forums and the platforms that release the code or offer the training and hosting services. They are diverse in tone and they provide varying forms of expertise. A significant portion of this documentation effort consists in code snippets, detailed explanations of parameters and options, bug reports, detailed instructions for the installation of software, tests of hardware compatibility. They are written by professionals, hackers, amateurs, newbies with access to very different infrastructure. Some take for granted unlimited access to compute whilst others make do with a limited local installation. This diversity reflects the position of LoRAs in the AI ecosystem: between expertise and informed amateurism and between resource hungry and consumer grade technology. Whereas foundational model training still remains in the hands of a (happy?) few, LoRA training opens up a perspective of democratization of the means of production for those who have time, persistence and a small capital to invest.

Curation as an operational practice

There is more to LoRA than the technicalities of installing libraries and training. LoRAs are curated objects. Many tutorials begin with a primer on dataset curation. Fans, artists and amateurs produce an abundant literature on the various questions raised by dataset curation: the identification of sources, the selection of images (criteria of quality, diversity, etc), the annotation (tagging), scale (LoRAs can be trained on datasets containing as little as one image and can include collections of thousands of images). As we said above, typically, a user decides to embark on the adventure of creating a LoRA because available models fail to generate convincing images for a given subject or style. But they don't start from scratch. They identify the model that approximates best their objective and select images to improve the perceived lacks. LoRA curators look for precision and nuances rather than quantity. They go to great length to identify the most representative visuals for the purpose they have in mind, but they don't do that in the abstract. They identify their samples in response to the existing weaknesses of the model's output.

Remodelling as rewording

The objective of LoRA's curation is to form the learning base for remodelling, not modelling. The importance of that distinction is palpable in the various decisions involved in annotating images in the training set. There are different means of annotating images. To select the right one, the annotator must know how the original model has been trained. For photorealistic images, most models have been annotated with a piece of software called BLIP (which stands for Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation).[7]. BLIP produces descriptions in 'natural language' such as "a high resolution photograph of a man sitting on a beach in the moonlight". In the case of LoRAs in anime style, understanding the semantic logic of tagging brings the annotator in the booru universe. Boorus (the word board pronounced in Japanese ) are image boards designed to host collections of animes.[8] Boorus are targets of choice for AI scrapers as they contain huge amounts of images and are frantically annotated by their creators. As Knxo aptly notes:

Danbooru: Danbooru style captioning is based in the Booru tagging system and implemented in all NAI derivatives and mixes which accounts for most SD1.5 non photorealistic models. It commonly appears in the following form "1girl, green_eyes, brown_hair, walking, forest, green_dress, eating, burrito, (sauce)". This tagging style is named after the site, as in https://danbooru.donmai.us/. Whenever you have doubt on the meaning of a tag you can navigate to danbooru, search for the tag and open it's wiki.

Take for example the following, search for the tag "road" when we open it's wiki we will see the exact definition as well as derivative tags like street, sidewalk or alley as well as the amount of times the image has been used(13K). In Practice what this means is that the concept is trained to some degree in NAI based models and mixes. The amount of times the tag appears in danbooru actually correlates to the strength of the training(as NAI was directly trained on Danbooru data). So any concept below 500 coincidences are a bit iffy. Keep that in mind when captioning as sometimes it makes more sense to use a generic tag instead of the proper one, for example "road" appears 13k times while "dirt_road" only does so 395 times. In this particular case using dirt_road shouldn't be problematic as "dirt_road" contains "road" anyway and SD is able to see the association.[9]

The LoRA's creator skills include a knowledge about the cultures from which the underlying model has learned. For the vocabulary and syntax and for the comparative weight given to individual concepts learned by the model. The tagging of the LoRA's dataset mirrors and rewords the tagging of the underlying model. This means that the user gradually develops an acute sense of the model's biases (how it weighs more some terms than others, and exclude/ignore terms). In that context, tagging is an intricate dance with the bias in order to reverse the problem or work with it. Even if the object of the annotator's effort might seem superficial (adding yet another LoRA for a character that is already featured in hundreds of others), it represents a form of specialized conceptual labour. This testifies to the level of competence in visual culture that is expected from fans and advanced users and their ability to think about image beyond immediate representation but more structurally and abstractly, as well as programmatically.

Negotiating the boundaries of property and identity

Curation involve ethical decisions that are often mentioned in tutorials and posts. Is it OK to scrape from any source? When is appropriation legitimate? The implications of these questions become apparent when users have to take position for or against making style LoRAs. A style LoRA is a small adapter that can be included in a generative pipeline in order to make images in the style of a given artist. Fans are caught between their desire to engage more deeply with the universe of their favourite authors, but they are also aware that they might infringe on their right. In fan communities, to reproduce a character is a well-accepted practice. The site DevianArt features endless crowds of unicorns made by anime lovers. To enable style mimicry is perceived as going a step further. This time is to substitute oneself to the author. A dilemma of the same kind occurs when users produce LoRAs that either make it easier to generate realistic images or that make it possible to generate a convincing representation of an existing person. The production of deepfake is often prohibited in the notice that accompanies models. Nevertheless, a quick search on CivitAI reveals an impressive number of LoRAs impersonating actors, politicians and porn stars. Which brings the platform to the limit of legality.

Another distribution of labour

The curatorial practice of LoRA creators is very different than the one used to assemble large-scale datasets for foundational models. The curators of huge datasets such as LAION privilege broad scraping and an automatic process. In the case of LoRAs, the creators are manually picking, and, even when they resort to scraping, they visually monitor the results. Each individual image is validated by the curator. Further, this curatorial practice forms part of an integrated whole. The same person selects the images, writes the captions and trains the LoRA. This again sets the mode of production of LoRAs apart from big models' where separate entities deal with the dataset's curation and the training of the model. But we should not simply oppose the craft of LoRA curators to the industrial approach of large dataset creators as they depend on each other. Either for their mere existence (no LoRA without underlying model) , or to gain value (a large model becomes more popular if a LoRA extends its capabilities).

Baking the cake

When the dataset is ready, the training phase begins. Training solicits a different set of skills and knowledge. If some users quickly put together a collection of images and train the LoRA with default parameters while hoping for the best, experienced users delve into a vertiginous list of parameters to 'bake the perfect cake'. To achieve this goal, they must strike a balance between the ideal amount of training and the available material infrastructure, as well as the time during which the machine can be 'monopolized' by the training process. On a personal laptop, better results can be obtained with 24 hours of training. But this means that the machine won't be available for that time. And 24 hours might be long to check the results. Training is time management combined with an evaluation of resource availability that the user must learn to translate into a selection of arcane parameters such as epochs, batch size and steps. The user faces another challenge: the problem of 'overcooking'. A LoRA is 'overcooked' when it reproduces too literally its training content. In the computer science jargon, this effect called overfitting is correlated to a selection of parameters that make the model memorize the noise and other irrelevant information from the dataset.[10] When making a LoRA for adding the representation of a particular object in the model, an example of overfitting would be that the model would not only generate images with this object, but also include elements of the background of the training images disrupting the visual consistency of the newly created picture. For many, baking a LoRA is an art more than a science and the choice of parameters is influenced by reports circulating on an extensive network of platforms that range from Reddit and CivitAI to GitHub and Hugging Face to Arxiv. At the training stage, the network extends from the sites that provide images and curatorial resources to a mix of highly technical pages and sites where informal conversation can be had.

The length of the process combined with its opaque nature imposes many cycles of iteration. The training can be stalled after a number of steps. It can abruptly end with a sibylline error message. The computer may freeze. The user may run out of credits. Or, the results may be sorely disappointing. All these problems trigger search queries, intense readings of manuals, discussions on chats and forums, the writing of desperate X posts and soul searching blog entries. The user navigates these sites piecing out the information, seeking answers and emotional support. Finding reliable sources of information, they weave a dense network without which they wouldn't manage to make sense of the enigmatic behaviour of their 'oven'. They have to learn to balance the authoritative voices of the authors of the software documentation and the rawer commentaries of those who, like themselves, struggle to make it work. It is not rare to find the users who share tips and tutorials express different grades of confidence regarding their choices about loss functions or schedulers. With sometimes a clear admission of powerlessness: "no idea just leave it at 1".

How does it create value? Or decrease / affect value?

LoRAs are definitely objects of interest in the communities of genAI creators. Publishing a good LoRA raises the a user's status. It sets them apart in the extent to which the small model provides capabilities desired by other fans and creators. It helps to bypass limitations and expand the range of potential representations. Technically, the LoRA creator gains a knowledge and the writing of tutorials serve to share one's experience but also to attract visibility and legitimate one's status. For a platform such as CivitAI, the creation of LoRAs is encouraged. A dedicated page of the site's education section offers several resources and a glossary.[11] Hosting LoRAs has a clear advantage. It adds value to the base models. Combined with LoRAs, their capabilities are expanded. Essentially the ever growing offer allows them to reach to the specific tastes and interests of variegated groups of users, therefore extending the platform's audience.

Rewards for the best creative use of a LoRA.[12]

Market for LoRAs, the power of demand

The popularity of LoRA in terms of offer and demand gives a peek into the economy of visibility on the CivitAI platform. Users produce and consume these small models abundantly. To the degree that some of them decry a "mass production".[13] Many transactions on the platform concern LoRAs. Using the platform currency, buzz, many users posts bounties [14] where they describe the desired characteristics of a needed LoRA. And LoRA creators offer their services [15] against remuneration or for free. LoRA's expertise is sought after. Yet, this expertise is not necessarily monetised. Many users ask others ideas for LoRAs that they subsequently generate for free and share with the whole community. Even if there is no financial return, the user gains visibility and is granted status among their peers. This motivates some users to offer buzz rewards to those who are using their LoRA creatively and thereby demonstrate their small model's brilliance and relevance. This testifies to the logic of an economy of visibility where what is sought after is a missing link between large models and user practice. In this gap, the skills of the LoRA creator is worth its salt. But through LoRA, what the platform and the LoRA makers are trying to reach is the elusive demand. And the more abundant the offer becomes, the most scarce and therefore valuable demand becomes. In the context of over production and sharing, of saturation of the imaginary, the LoRA is a device to fill the last remaining gaps. It is also a device to find the last subjects for whom something remains to be desired.

What is its place/role in techno cultural strategies?

Screengrab of the LoRA page on the civit.ai platform

A quick look at the LoRA page on CivitAI gives an idea of the forms of gender representation that dominates the platform. When discussing the site's visuals, a participant to a workshop we gave in the context of Xeno Visual Studies in Madrid eructed "this is incel culture". And indeed a large portion of the LoRAs feature female characters with large breasts and bodybuilded male heroes. If parodies and critiques of these stereotypes also circulate on the platform as the opening vignette of this entry demonstrates, they remain limited in numbers.

While we experimented ourselves with the creation of LoRAs, we wondered how we could begin responding to the lack of diversity in the representations of women in both LoRAs and models. Our small experiment took the form of a LoRA to represent the artist Louise Bourgeois. Indeed prompting a model such as Real Vision with a query such as "The artist Louise Bourgeois working in her studio" resulted in an image of an older woman dressed in a contemporary attire with a vague physical likeness to the artist. Assembling a dataset from online and offline images, we ended up with a dozen candidates which we annotated in the Draw Thing software locally. Subsequently we used Draw Things to train a LoRA with RealVision as our base model and explored the results.

An image of Louise Bourgeois with the Real Vision model
Louise Bourgeois (Real Vision)
A screenshot of a search query for Louise Bourgeois
Selected images from the search results
Annotations for the dataset in the Draw Thing interface
An image generated by Real Vision with LoRA
An image generated by Real Vision with LoRA

Interestingly, with the LoRA, the same model managed to produce a more resembling character. The facial expression with the ironic smile characterizing Bourgeois was present. The general texture of the artworks surrounding Bourgeois were also closer to her work although they remained more academic in style. Instead of showing her drawing at the studio's table, the model was now showing her in contact with the sculptures. Whilst this experiment remained limited and the portraits of Bourgeois were still rather crude, the potential for an engagement with the artist's representation clearly appeared to us. However, this left us with a difficult question. If there is an undeniable production of technical knowledge and skill and (as we develop below) a gain in autonomy from the practice of LoRA creation, how could this potential of technical emancipation be aligned with a different aesthetics and politics of representation?

How does it relate to autonomous infrastructure?

The precondition for the existence of LoRA is the realisation that base models cannot generate everything, that they are partial and biased. This technique allows to regain partial control over the training, a part of the technical process that is usually reserved to a handful of companies. In that perspective, the existence of LoRAs evokes the possibility of a re-appropriation of the model via fine-tuning. Even if not complete, this helps users regain some form of autonomy from large models providers. In particular because their needs are defined bottom-up.

As written above, the interest for LoRAs corresponds for many users to the realisation that the interaction with AI generators is showing some limits. If these limits are first felt at the level of content, the journey of LoRA creation confronts the users to another important limitation: the availability of hardware. LoRA training provokes an engagement with the material plane of genAI ecosystem and its logistics. The materiality of the system becomes palpable either through its costs, or through alternative means of accessing a highly coveted GPU. LoRA creation makes the user aware of the genAI ecosystem economy as access to compute is 'linked' to different forms of currencies. Be it in platforms such as CivitAI, the buzz, or in networks such as Stable Horde, kudos. This being said, fine-tuning techniques benefit from the evolution of the material conditions. As the technical requirements become lighter, LoRA production can happen in different environments. In privileged countries, advanced users can train locally on their own machine with consumer-grade GPUs. With a small budget, LoRAs can be trained online. Infrastructurally speaking, edge AI and 'community' platforms are in a position to meet the needs for LoRA training and therefore decentralize a step further the training process. But to date, peer to peer networks such as Stable Horde are still limited to inference.

The largest gain is in term of literacy and understanding of the training process more generally. Indeed as LoRAs are miniature models, the skills and expertise related to curation, sourcing, annotation, and model semantics are being developed through a peer to peer effort in communities of amateurs and image makers. This knowledge that initially pertained to niches of specialists is being popularized, shared and translated to numerous cultural contexts. If there are still many obstacles to a radical delinking from platforms, there are many encouraging signs that point to a potential convergence between communities and a less centrally controlled infrastructure. LoRA creation might not be the undoing of the centralizing power of hegemonic platforms, far from there. But it can be a step in that direction.

[1] BigHeadTF, The Incredible Hulk (2008) – V1, Civitai, published February 18, 2025, accessed August 18, 2025, https://civitai.com/models/1266100/the-incredible-hulk-2008.

[2] BigHeadTF, The Incredible Hulk (2008) – V1.

[3] Efrat, “LoRA Under the Hood: How It Really Works in Visual Generative AI,” Medium, accessed August 18, 2025, https://medium.com/@efrat_37973/lora-under-the-hood-how-it-really-works-in-visual-generative-ai-e6c10611b461.

[4] knxo, “Making a LoRA Is Like Baking a Cake,” Civitai, published July 10, 2024, accessed August 18, 2025, https://civitai.com/articles/138/making-a-lora-is-like-baking-a-cake.

[5] knxo, “Making a LoRA Is Like Baking a Cake.”

[6] Danbooru, "Dirt Road", Danbooru, accessed August 18, 2025, https://danbooru.donmai.us/posts?tags=dirt_road

[7] Junnan Li, Khaliq Ahsen, BLIP: Bootstrapping Language-Image Pre-training, GitHub repository, Salesforce, 2022, https://github.com/salesforce/BLIP.

[8] Wikipedia, “Imageboard,” last modified August 2025, accessed August 18, 2025, https://en.wikipedia.org/wiki/Imageboard#Booru.

[9] knxo, “Making a LoRA Is Like Baking a Cake.”

[10] IBM, “What Is Overfitting?” IBM Think, last modified October 15, 2021, accessed August 18, 2025, https://www.ibm.com/think/topics/overfitting.

[11] Civitai, “LoRA Training Glossary,” Civitai Education, accessed August 18, 2025, https://education.civitai.com/lora-training-glossary/.

[12] tehalex86, 5K CrazyWhatever, Civitai, accessed August 18, 2025, https://civitai.com/bounties/8690/5k-crazywhatever.

[13] Stagnation, “The Most Prolific Character LoRAs in Existence by FAR,” Civitai, published February 2, 2024, accessed August 18, 2025, https://civitai.com/articles/3940/the-most-prolific-character-loras-in-existence-by-far.

[14] ColonelJay, "Multiple characters Bounty Results!", Civitai, published August 16, 2025, accessed August 21, 2025, https://civitai.com/articles/17883/multiple-characters-bounty-results

[15] extrafuzzy, "How I made over $1000 in 3 months selling LoRAs", Civitai, published October 22, 2023, accessed August 21, 2025, https://civitai.com/articles/2684/how-i-made-over-dollar1000-in-3-months-selling-loras.

Mapping 'objects of interest and necessity'

If one considers generative AI as an object, there is also a world of ‘para objects’ surrounding AI, and shaping its reception and interpretation, in the form of maps or diagrams of AI. They are drawn by both amateurs and professionals, who need to represent processes that are otherwise sealed off in technical systems, but more generally reflect a need for abstraction – a need for conceptual models of how generative AI functions. However, as Alfred Korzybski famously put it, one should not confuse the map with the territory: the map is not how reality is, but a representation of reality.[1]

Following on from this, mapping the objects of interest in autonomous AI image creation is not to be understood as a map of what AI 'really is'. Rather, it is a map of encounters of objects; encounters that can be documented and catalogued, but also positioned in a spatial dimension – representing a 'guided tour', and an experience of what objects are called, how they look, how they connect to other objects, communities or underlying infrastructures (see also Objects of interest and necessity). Perhaps, the map can even be used by others to navigate autonomous generative AI and create their own experiences? But, importantly to note, what is particular about the maps of this catalogue of objects of interest and necessity, is that it is an attempt to map autonomous generative AI. It does, in other words, not map what is otherwise concealed in, say, Open AI's DALLE-E or Adobe Firefly. In fact, we know very little of how to navigate such more proprietary systems, and one might speculate if there even exists a complete map of their relations and dependencies.

Perhaps because of this lack of overview and insight, maps and cartographies are not just shaping the reception and interpretation of generative AI, but can also be regarded as objects of interest and necessity in themselves – intrinsic parts of AI’s existence. Generative AI depends on an abundance of cartography to model, shape, navigate, and also negotiate and criticise its being in the world. There seems to be an inbuilt cultural need to 'map the territory', and the collection of cartographies and maps is therefore also what makes AI a reality – making AI real by externalising its abstraction onto a map, so to speak

A map of 'objects of interest and necessity' (autonomous AI image generation)

To enter the objects of autonomous AI image generation, a map that separates the territories of ‘pixel space’ from ‘latent space’ can be useful as a starting point – that is, a map that separates the objects you see from those that cannot be seen because they exist in a more abstract, computational space.

Latent space

Latent space is a highly abstract space consisting of compressed representations of images and texts. A key object is the Variational Autoencoder (VAE) that makes the image-texts available to different kinds of operations – whose results are then decoded back into images. An important operation happening in the latent space is the training of an algorithm. In diffusion-based algorithms, the algorithm is trained by learning to apply noise to an image and then reconstruct an image, from complete or random noise (this process is discussed more in-depth in the entry on diffusion).

To continue our mapping, it is important to note that the latent space is nurtured by various sources. In the process of model training, datasets play a crucial role. Many of the datasets that are used to train models are made by 'scraping' the internet, while others are built on repositories like Instagram, flickr, or Getty images. Open Images and ImageNet are commonly used as the backbone of visually training generative AI, built on web-pages, but corporate organisations like Meta and Google also offer open source datasets, as do e.g., research institutions and others. Contrary to common belief, there is not just one dataset used to make a model work, but multiple models and datasets to, for instance, reconstruct missing facial or other bodily details (such as too many fingers on one hand), 'upscale' images of low resolution or 'refine' the details in the image. LoRAs trained on users own curated datasets are also often used in AI imaging with Stable Diffusion. The latent space is therefore an interpretation of a large pool of visual and textual resources, external to it.

When it comes to autonomous AI imaging, there is typically an organisation and a community behind each dataset and training. LAION (Large-scale-Artificial Intelligence Open Network) is a good example of this, and a very important one. It is a non-profit community organisation that develops and offers free models and datasets. Stable Diffusion was trained on datasets created by LAION, using Common Crawl (another non-profit organisation that has built a repository of 250 billion web pages) and CLIP (OpenAI's neural network which learns visual concepts from natural language supervision) to compile an extensive record of links to images with 'alt text' (a descriptive text for non-text content, created by 'web masters' for increased accessibility) – that is a useful set of annotated images, to be used for model training. We begin to see that a model's dependencies have large organisational, social and technical ramifications.

(Read more on latent space in its own entry, as well as Variational Autoencoder, VAE)

Pixel space

In pixel space you find a range of visible objects that a typical user would normally meet. This includes the interfaces for creating images. In conventional interfaces like DALL-E or Bing Image Creator, users prompt in order to generate images. What is particular for autonomous and decentralised AI image generation is that the interfaces have many more parameters and ways to interact with the models that generate the images. It functions more like an 'expert' interface.

In pixel space one finds many objects of visual culture. Apart from the interface itself, this includes both all the images generated by AI, and all the images used to train the models behind. These images are, as described above, used to create datasets - compiled by crawling the internet and collections, and scraping images that all belong to different visual cultures (ranging, e.g., from museum collections of paintings to criminal records with mug shots).

Many users also have specific aesthetic requirements to the images they want to generate. Say, to generate images in a particular manga style or setting. The expert interfaces therefore also contains the possibility to combine different models and even to post-train one's own models, also known as a LoRA (Low-Rank Adaptation). When sharing the images on platforms like Danbooru (one of the first and largest image boards for manga and anime) images are typically well categorised – both descriptively ('tight boots', 'open mouth', 'red earrings', etc.) and according to visual cultural style ('genshin impact', 'honkai', 'kancolle', etc.). Therefore they can also be used to train more models.

A useful annotated and categorised dataset - be it for a foundation model or a LoRA – typically involves specialised knowledge of both the technical requirements of model training (latent space) and the aesthetics and cultural values of visual culture itself (pixel space). For instance, of common visual conventions such as realism, beauty, horror, and also (in the making of LoRAs) of more specialised conventions such as ,say, a visual style that an artist wants to generate (see e.g. the generated images of Danish Hiphop by Kristoffer Ørum[2]).

An organisational plane

AI generated images as well as other objects of pixel space and latent space (like software, interfaces, datasets, or models involved in image generation) are not just products of a technical system, but also exist in a social realm, organised on platforms that are driven by the users themselves or commercial companies.

In mainstream visual culture the organisation is structured as a relation between a user and a corporate service. For example, users use Open AI's DALL-E to generate images, and may also share them afterwards on social media platforms like Meta's Instagram. In this case, the social organisation is more or less controlled by the corporations who typically allow little interaction between their many users. For instance, DALL-E does not have a feature that allows one to build on or reuse the prompt of other users, or of users to share their experiences and insights with generative AI image creation. Social interaction between users only occurs when they share their images on platforms such as, say, Instagram. Rarely are users involved in the social, legal, technical or other conditions for making and sharing AI generated images, and they have little to say about how these platforms are governed, moderated and censored.

Conversely, on the platforms for generating and sharing images in more autonomous AI users and communities are deeply involved in the conditions for AI image generation. LAION, again, is a good example of this. It is run by a non-commercial organisation or 'team' of around 20 members, led by Christoph Shumann, but their many projects involve a wider community of AI specialists, professionals and researchers. They collaborate on principles of access and openness, and their attempt to 'democratise' AI stands in contrast to the policies of Big Tech AI corporations. In many ways, LAION resembles what the anthropologist Chris Kelty has also labelled a 'recursive publics' – a community that care for and self-maintain the means of its own existence.[3]

However, such openness is not to be taken for granted, as also noted in debates around LAION.[4] There are many platforms in the ecology of autonomous AI ( see also CivitAI and Hugging Face) that easily become valuable resources. The datasets, models, communities, and expertise they offer may therefore also be subject to value extraction. Hugging Face is a prime example of this - a community hub as well as a $4.5 billion company with investments from Amazon, IBM, Google, Intel, and many more; as well as collaborations with Meta and Amazon Web Services. This indicates that in the organisation of autonomous AI there are dependencies on not only communities, but often also on corporate collaboration and venture capital.

A material plane (GPU infrastructure)

Just like the objects of autonomous AI depend on a social organisation (and also on capital and labour, see currencies), they also depend on a material infrastructure – and are, so to speak, always suspended between many different planes. First of all, on hardware and specifically the GPUs that are needed to generate images as well as the models behind. Like in the social organisation of AI image generation, infrastructures too are organised differently.

The mainstream commercial services are set up as what one might call a 'client-server' relation. The users of DALL-E or similar services access a main server (or a 'stack' of servers). Users have little control of the conditions for generating models and images (say, the way models are reused or their climate impact) as this happens elsewhere, in 'the cloud'.

Autonomous AI distinguishes itself from mainstream AI in the decentralised organisation of processing power. Firstly, people who generate images or develop LoRAs with Stable Diffusion can use their own GPU. Often a simple laptop will work, but individuals and communities involved with autonomous AI image creation may also have expensive GPUs with high processing capability (built for gaming). Secondly, there is a decentralised network that connects the community's GPUs. That is, using the so-called Stable Horde (or AI Horde), the community can directly access each other's GPUs in a peer-to-peer manner. Granting others access to one's GPU is rewarded with currencies that in turn can be used to skip the line when waiting to access other members' GPUs. This social organisation of a material infrastructure allows the community to generate images almost with the same speed as commercial services.

To be dependent on the distribution of resources, rather than a centralised resource (e.g., a platform in 'the cloud'), points to how dependencies are often deliberately chosen in autonomous AI. One chooses to be dependent on a community because, for instance, one wants to reduce the consumption of hardware, because it is more cost-effective than one's own GPU, because one cannot afford the commercial services, or simply because one prefers this type of organisation of labour (separated from capital) that offers an alternative to Big Tech. That is, simply because one wants to be autonomous.

At this material plane, there are many other dependencies. For instance, energy consumption, the use of expensive minerals for producing hardware, or the exploitation of labour in the production of hardware.

Mapping the many different planes and dependencies of generative AI

What is particular about the maps of this catalogue of objects of interest and necessity, is that they purely attempt to map autonomous generative AI imaing, serving as a map for a guided tour and experience of autonomous AI. However, both Hugging Face' dependency on venture capital and Stable Diffusion's dependency on hardware and infrastructure point to the fact that there are several planes that are not captured in the above maps – all are equally important. For instance, The EU AI Act or laws on copyright infringement, which Stable Diffusion (like any other AI ecology) will also depend on, point to a plane of governance and regulation. AI, including Stable Diffusion, also depends on the organisation of human labour, the extraction of resources, as well as a technical organisation of knowledge. The dependencies on capital should not be forgotten either.

In mapping the objects of interest and necessity, we attempt to describe how Stable Diffusion and autonomous AI image generation build on dependencies to these different planes, but an overview of the many planes of AI and how it 'stacks' can of course also be the centre of a map in itself. One example of this is Kate Crawford's Atlas of AI, a book that displays different maps (and also images) that link AI to 'Earth' and the exploitation of energy and minerals, or 'Labour' and the workers who do micro tasks ('clicking' tasks) or the workers in Amazon's warehouses. Additionally, Crawford's book has chapters on 'Data', 'Classification', 'Affect', 'State' and 'Power'.

Another abstraction of the layered nature of generative AI is found in Gertraud Koch's map of all layers that she and her coauthors connects to "technological activity", which is clearly relevant to AI.[5] On top of a layer of technology (the 'data models and algorithms') one will find other layers that are interdependent, and which contribute to the political and technological qualities of AI. As such, the map is also meant for navigation – to identify starting points for rethinking its concepts or reimagining alternative futures (in their work, particularly in relation to a potential delinking from a colonial past, and reimagining a pluriversality of technology)

Within the many planes and stacks of AI one can find many different maps that build other types of overviews and conceptual models of AI – perhaps pointing to how maps themselves take part in making AI a reality.

The corporate landscape

The entrepreneur, investor and podcast host Matt Turck has made the “ultimate annual market map of the data/AI industry”. Since 2012 he has documented the corporate landscape of AI not just to identify key corporate actors, but also developments of trends in business. As he also notes in his blog, the first map from 2012 has merely 139 logos, whereas the 2024 version has 2,011 logos.[6] This reflects the massive investment in AI entrepreneurship, following first 'big data' and now 'generative AI' (and machine learning) - how AI has become a business reality. Comparing the 2012 version with the most recent map from 2024, one can see the corporate landscape of AI changes over time. How, for instance, the division of companies dealing with infrastructure, data analytics, applications, data sources, and open source AI becomes fine grained over the years, forking out into, applications in health, finance and agriculture; or how privacy and security become of increased concern. Clearly, AI reconfigures and intersects with many different realities.

Critical cartography in the mapping of AI

In mapping AI, there are also 'counter maps' or 'critical cartography'.[8] Conventional world maps are built on set principles of, for instance, North facing up, and Europe at the centre. The map is therefore not just a map for navigation, but also a map of more abstract imaginaries and histories originating in colonial times, where maps were the outset of Europe and an intrinsic part of the conquest of territories. In this sense, a map always also reflects hierarchies of power and control that can be inverted or exposed (for instance by turning the map upside down, letting the south be a point of departure). Counter-mapping technological territories would, following this logic, involve what the French research and design group Bureau d´Études has called "maps of contemporary political, social and economic systems that allow people to inform, reposition and empower themselves."[9] They are maps that reveal underlying structures of social, political or economic dependencies to expose what ought to be of common interest, or the hidden grounds on which a commons rests. Félix Guattari and Gilles Deleuze' notion of 'deterritorialization' can be useful here, as a way to conceptualise the practices that expose and mutate the social, material, financial, political, or other organisation of relations and dependencies.[10] The aim is not only to destroy this 'territory' of relations and dependencies, but ultimately a 'reterritorialization' – a reconfiguration of the relations and dependencies.

Utilising the opportunities of info-graphics in mapping can be a powerful tool. At the plane of financial dependencies, one can map, as Matt Turck, the corporate landscape of AI, but one can also draw a different map that reveals how the territory of 'startups' does not compare to a geographical map of land and continents. Strikingly, The United States is double the size of Europe and Asia, whereas there are whole countries and continents that are missing (such as Africa). This map thereby not only reflects the number of startups, but also how venture capital is dependent on other planes, such as politics and the organisation of capital, or infrastructural gaps. In Africa, for instance, the AI divide is very much also a 'digital divide', as argued by AI researcher Jean-Louis Fendji.[11]

Counter-mapping the organisation of relations and dependencies is also prevalent in the works of the Barcelona-based artist collective Estampa, which also exposes how generative AI depends on different planes: venture capital, energy consumption, a supply chain of minerals, human labour, as well as other infrastructures, such as the internet, which is 'scraped' for images or other media).

Epistemic mapping of AI

Maps of AI often also address how AI functions as what Celia Lury has called an 'epistemic infrastructure'.[14] That is, AI is an apparatus that builds on knowledge, creates knowledge, but also shapes what knowledge is and what we consider to be knowledge. To Lury, the question of 'methods' here becomes central - not as a neutral, 'objective' stance, as one typically regards good methodology in science, but as a cultural and social practice that helps articulate the questions we ask and what we consider to be a problem in the first place. When one for, instance, criticises the social, racial or other biases in generative AI (such as all doctors being white males in generative AI image creation), we are not just dealing with bias in the dataset that can be fixed with 'negative prompts' or other technical means. Rather, AI is fundamentally – in its very construction and infrastructure – based in a Eurocentric history of modernity and knowledge production. For instance, as pointed out by Rachel Adams, AI belongs to a genealogy of intelligence, and one also ought to ask, whose intelligence and understanding of knowledge is modelled within the technology – and whose is left out?[15]

There are several attempts to map this territory in the plane of knowledge production, and its many social, material, political or other relations and dependencies. Sharing many of the concerns of Lury and Adams, Vladan Joler and Matteo Pasquinelli's 'Nooscope' is a good example of this.[16] In their understanding, AI belongs to a much longer history of knowledge instruments ('nooscopes', from the Greek skopein ‘to examine, look’ and noos ‘knowledge’) that would also include optical instruments, but which in AI is a form of knowledge magnification of patterns and statistical correlations in data. The nooscope map is an abstraction of how AI functions as "Instrument of Knowledge Extractivism". It is therefore not a map of 'intelligence' and logical reasoning, but rather of a "regime of visibility and intelligibility" whose aim is the automation of labour, and of how this aim rests on (as other capitalist extractions of value in modernity) a division of labour – between humans and technology, between for instance historical biases in the selection and labelling of data, and their formalisation in sensors, databases and metadata. The map also refers to how selection, labelling and other laborious tasks in the training of models is done by "ghost workers" thereby referring to a broader geo-politics and body-politics of AI where human labour is often done by subjects of the Global South (although they might oppose being referred to as 'ghosts').

A map of AI as an instrument of knowledge by Vladan Joler and Matteo Pasquinelli (2020) — A map of AI as an instrument of knowledge (by Vladan Joler and Matteo Pasquinelli).

[1] Korzybski, Alfred. “A Non-Aristotelian System and Its Necessity for Rigour in Mathematics and Physics.” In Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics. Lancaster, 1933.

[2] Kristoffer Ørum, “Project #253,” accessed August 11, 2025, https://oerum.org/pico/projects/253.

[3] Christopher M. Kelty, Two Bits: The Cultural Significance of Free Software (Durham, NC: Duke University Press, 2008), https://read.dukeupress.edu/books/book/1136/Two-BitsThe-Cultural-Significance-of-Free-Software.

[4] “The Story of LAION: The Dataset Behind Stable Diffusion,” The Batch, June 7, 2023, https://www.deeplearning.ai/the-batch/the-story-of-laion-the-dataset-behind-stable-diffusion/.

[5] Gertraud Koch et al., “Layers of Technology in Pluriversal Design: Decolonising Language Technology with the Live Language Initiative,” CoDesign 20, no. 1 (2024): 77–90, https://doi.org/10.1080/15710882.2024.2341799.

[6] Matt Turck, “Full Steam Ahead: The 2024 MAD (Machine Learning, AI & Data) Landscape,” MattTurck.com, March 2024, https://mattturck.com/MAD2024/.

[7] Matt Turck, “A Chart of the Big Data Ecosystem,” MattTurck.com, June 29, 2012, https://mattturck.com/a-chart-of-the-big-data-ecosystem/.

[8] Jeremy W. Crampton and John Krygier, “An Introduction to Critical Cartography,” ACME: An International Journal for Critical Geographies 4, no. 1 (2005): 11–33, https://doi.org/10.14288/acme.v4i1.723.

[9] Bureau d'Études, accessed August 11, 2025, https://bureaudetudes.org/.

[10] Gilles Deleuze and Félix Guattari, A Thousand Plateaus: Capitalism and Schizophrenia, trans. Brian Massumi (Minneapolis: University of Minnesota Press, 1987).

[11] “From Digital Divide to AI Divide – Fellows’ Seminar by Jean-Louis Fendji,” STIAS, April 9, 2024, https://stias.ac.za/2024/04/from-digital-divide-to-ai-divide-fellows-seminar-by-jean-louis-fendji/.

[12] Marcus Lu, “Mapped: The Number of AI Startups by Country,” Visual Capitalist, May 6, 2024, https://www.visualcapitalist.com/mapped-the-number-of-ai-startups-by-country/.

[13] “Cartography of Generative AI,” Estampa, accessed August 11, 2025, https://tallerestampa.com/en/estampa/cartography-of-generative-ai/.

[14] Celia Lury, Problem Spaces: How and Why Methodology Matters (Cambridge, UK; Medford, MA: Polity, 2021).

[15] Rachel Adams, “Can Artificial Intelligence Be Decolonized?,” Area 53, no. 1 (2021): 6–13, https://doi.org/10.1080/03080188.2020.1840225.

[16] Vladan Joler and Matteo Pasquinelli, The Nooscope Manifested: AI as Instrument of Knowledge Extractivism, 2020, https://fritz.ai/nooscope/.

Model card

As models begin to pile up in open repositories like Hugging Face, model cards have emerged as a privileged means to document them.[1] Think about model cards as nutrition labels for models. Ideally, they list the model's ingredients, how it was trained and its validation procedures as well as its intended use and limitations. Whilst code repositories cannot force their use upon the users, they automatically create an empty model card when a new models is uploaded in an effort to encourage standardization and transparency.[2] However, examining how model cards are redacted, one can see that their content varies a lot. Sometimes, they thoroughly document the model with a reference to an academic paper, sometimes they offer only minimal information or are simply empty. In that, model cards testify to the diverse nature of model providers. Some are working in computer science labs or in companies, others are amateurs with sometimes little time left for the tedious work of documentation or simply no desire to share widely their production. Finally, an empty model card doesn't necessarily mean absence of documentation. Users may find it more appealing to document them in other forms. In CivitAI, a platform where Manga fans share their models and LoRAs, each model is introduced with a succinct description written in a more affective tone where the author explains their goal, cracks a joke, begs for a tip on their Patreon and thanks their network of collaborators as well as the models and resources they are building on.

Examples of model cards:

[1] Liang, Weixin, Nazneen Rajani, Xinyu Yang, et al. “What’s Documented in AI? Systematic Analysis of 32K AI Model Cards.” 2024. https://arxiv.org/abs/2402.05160.

[2] Mitchell, Margaret, Simone Wu, Andrew Zaldivar, et al. “Model Cards for Model Reporting.” Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, ACM, January 2019, 220–29. https://doi.org/10.1145/3287560.3287596.

Pixel space

In pixel space, you find a range of visible objects that a typical user would normally meet. This includes the interfaces for creating images. In conventional interfaces like DALL-E or Bing Image Creator, users prompt in order to generate images. What is particular for autonomous and decentralised AI image generation is that the interfaces have many more parameters and ways to interact with the models that generate the images. It functions more like an 'expert' interface.

In pixel space one finds many objects of visual culture. Apart from the interface itself, this includes both all the images generated by AI, and all the images used to train the models behind. These images are, as described above, used to create datasets, compiled by crawling the internet and scraping images that all belong to different visual cultures – ranging, e.g., from museum collections of paintings to criminal records with mug shots.

Many users also have specific aesthetic requirements to the images they want to generate. Say, to generate images in a particular manga style or setting. The expert interface therefore also contains the possibility to combine different models and even to post-train one's own models, also known as a LoRA (Low-Rank Adaptation). When sharing the images on platforms like Danbooru (one of the first and largest image boards for manga and anime) images are typically well categorised – both descriptively ('tight boots', 'open mouth', 'red earrings', etc.) and according to visual cultural style ('genshin impact', 'honkai', 'kancolle', etc.). Therefore they can also be used to train more models.

A useful annotated and categorised dataset – be it for a foundation model or a LoRA – typically involves specialised knowledge of both the technical requirements of model training (latent space) and the aesthetics and cultural values of visual culture itself. For instance of common visual conventions, such as realism, beauty, horror, and also (in the making of LoRAs) of more specialised conventions – say a visual style that an artist or a cultural community want to generate.

Prompt

In a nutshell, a prompt is a string of words meant to guide an image generator in the creation of an image. In our guided tour, this object belongs to what we call pixel space. In theory, a user who writes a prompt can dispense with the whole machinery involved in the image generation process. But, as we will see, in reality, prompting also resonates with other layers of the AI image generation ecosystem.

What is the network that sustains this object?

Prompts can be shared or kept private. But a search on prompting in any search engine yields an impressive amount of results. Among AI image creators, a prompt is an object of exchange as well as a source of inspiration and a means of reproduction. There is an economy of sharing for prompts that encompasses lists of best prompts, tutorials and demos. In CivitAI, for instance, users post images together with the prompt they used to generate them, encouraging others to try them out.

How does it evolve through time?

This avocado armchair could be the future of AI, writes the MIT Technology Review in January 2021 [1]

Technically, the authors of the Weskill's blog identify the year 2017 as a watershed moment that came with the “Attention Is All You Need” paper that introduced the transformer architecture and the Zero Shot prompting technique: "You supply only the instruction, relying on the model’s pre‑training to handle the task."[2] The complexity of generating text or image is abstracted away from the user and supported by a huge computing infrastructure operating behind the scene. In that, prompt-based generators break from previous experiments with GANs which remained confined to a technically skilled audience and hold a promise of democratization in terms of accessibility and simplicity. They also benefit from the evolution of text generation models. For instance, prompting in the recent Flux models differ from the more rudimentary Stable Diffusion, as they integrate the advances in LLMs to add a more refined semantic understanding of the prompt. For the users, this translates into an evolution from a form of writing that consisted in a list of tags to descriptions in natural language.

How does it create value? Or decrease / affect value?

Prompt adherence (a model's ability to respond to prompts coherently) is a major criteria to evaluate its quality. Yet prompt adherence can be understood in different ways. The interfaces designed for prompting oscillate between two opposite poles reflecting diverging tendencies in what users value in AI systems: a desire of simplicity 'just type a few words and the magic happens' and a desire of control. In the first case, the philosophy is to ask as little as possible from the user. For example, early ChatGPT offered two parameters through the API: prompt and temperature. With only a few words, the user gets what they supposedly want. And the system has to make up for all the bits that are missing. This apparent simplicity involves 'prompt augmentation' (the automatic amplification of the prompt) on the server side as well as a lot of implicit assumptions. At the opposite end of the spectrum, in interfaces like ArtBot, the prompt is surrounded by a vertiginous list of options where the user must make every choice explicitly. Here, the user is treated as an expert. Prompt expansion is visible to them, providing tools to improve the prompt and offer context.

What is its place/role in techno cultural strategies?

By interpreting the prompt, the system supposedly 'reads' what is in the user's mind. The interpreting the prompt involves much more than a literal translation of a string of words into pixels. It is the interpretation of the meaning of these words with all the cultural complexity this entails. As, historically, prompts were limited in size, this work of interpretation was performed on the basis of a very minimal description. Even often with a syntax reduced to a comma-separated list or a string of tags. Even now, with extended descriptions, the model is still tasked to fill the blanks. As the model tries to make do, it inevitably reveals its own bias. If a prompt mentions an interior, the model generates an image of a house that reflects the dominant trends in its training data. Prompting is therefore half ideation and half search: the visualisation of an idea (what the user wants to see) and the visualisation of the model's worldview. When prompting regularly, a user understands that each model has its own singularity. The model is trained with particular data. Further, it synthesizes these data in its own way. Through prompting, the user gradually develops a feel for the model's singularity. They elaborate semantics to work around perceived limitations and devise targeted keywords to which a particular model responds.

One strategy adopted by users who want to expose the models limitations, faults, or biases is what we could call 'reflexive prompting.' For instance here are two images generated with the model EpicRealism, with a prompt which inverts the traditional gender roles and asks for the picture of a man washing the dishes. The surreal results testify to the degree to which the model internalizes the division of labour in the household.

Image generated with the prompt Man washing the dishes
Image generated with the prompt Man washing the dishes

A more elaborate example of reflexive prompting is discussed in A Sign That Spells where Fabian Offert and Thao Phan analyze an experiment made by Andy Baio after the anouncement of a new DALL-E release that was intended to fix many thorny issues including racial bias. [3] Baio prompted DALL-E with an incomplete sentence such as "a sign that spells" without specifying what was the content to be spelled. The image generator produced images with people holding signs with words such as "woman", "black", "Africa", etc. The experiment demonstrated that the new DALL-E release consisted in a 'quick fix', instead of fixing the model's bias, DALL-E simply appended words to the user's prompt to orient the result towards a more 'diverse' output. As Offert and Phan put it, Baio's experiment revealed that OpenAI was not improving the model, but instead fixing the user.

This experiment testifies to the fact that nobody prompts alone. Prompts are rarely interpreted directly by the model. They go through a series of checks before being rendered. They are treated as sensitive and potentially offending. This has motivated different forms of censorship by mainstream platforms and it has propelled in return the development of many strategies aiming at gaming the model. In a workshop conducted by the artist Ada Ada Ada , they explained the poetic trick devised by users trying to bypass Midjourney censorship in order to generate an image of two vikings kissing on the mouth.[4] After many unsuccessful attempts, they circumvented the censorship mechanism with the prompt "The viking telling a secret in the mouth of another" .

How does it relate to autonomous infrastructure?

One strong motivation to adopt an autonomous infrastructure is avoiding censorship. Even if models are carefully trained to stay in the fold, working with a local model allows to do away with many layers of platforms censorship. For better or for worse, prompting the model locally means deciding individually what to censor. Interestingly, in distributed infrastructures such as Stable Horde, censorship is not absent. The developers go to great length to prevent prompts that would generate CSAM content.[5] Indeed the question of defending one's values and remaining open to a plural use of the platform remains a difficult techno-political issue even in alternative platforms. And the prompt is one of the component of the image generation pipeline that reveals this issue most directly.

[1] Heaven, Will Douglas. “This Avocado Armchair Could Be the Future of AI.” MIT Technology Review, May 1, 2021. https://www.technologyreview.com/2021/01/05/1015754/avocado-armchair-future-ai-openai-deep-learning-nlp-gpt3-computer-vision-common-sense/.

[2] Weskill. “History and Evolution of Prompt Engineering.” Weskill, April 23, 2025. https://blog.weskill.org/2025/04/history-and-evolution-of-prompt.html.

[3] Offert, Fabian, and Thao Phan. “A Sign That Spells: DALL-E 2, Invisual Images and The Racial Politics of Feature Space.” 2022. https://arxiv.org/abs/2211.06323.

[4] AIIM. “AIIM Workshop: Ada Ada Ada.” AIIM - Centre for Aesthetics of AI Images, June 1, 2025. https://cc.au.dk/en/aiim/events/view/artikel/aiim-workshop-ada-ada-ada.

[5] DB0. “AI-Powered Anti-CSAM Filter for Stable Diffusion.” A Division by Zer0, October 3, 2023. https://dbzer0.com/blog/ai-powered-anti-csam-filter-for-stable-diffusion/.

Horde AI or Stable Horde is a distributed cluster of GPUs. The project describes itself as a "volunteer crowd-sourced distributed cluster of image and text generation workers".[1] This translates as s network of individual GPU users that 'lend' their devices and stored large language models. This means that one can generate an image from any device connected to this network through an interface, e.g. a website through a phone. While the visible effects are the same as using ChatGPT, Microsoft Co-pilot or any other proprietary service, the images in this network are 'community generated.' The request is not sent to a server farm or a company, but to a user that is willing to share their GPU power and stored models. Haidra, the non-profit organisation associated with HordeAI, seeks to make AI free, open-source, and collaborative, effectively circumventing the reliance on AI bit-tech players.

Projects like Stable Horde/HordeAI offer a glimpse into the possibilities of autonomy in the world of image generation, and offer other ways of volunteering through technical means. In a way, this project inherits some of the ethos of P2P sharing and recursive publics, yet updated for the world of LLMs. The GPU used in this project is (intermittently), part of the HordeAI network, generating and using the kudos currency.

See: GPU: A horde of graphic cards

See: Currencies: GPU as currency

[1] Stable Horde. Stable Horde. Accessed August 26, 2025. https://stablehorde.net/.

Variational Autoencoder, VAE

With the variational autoencoder (VAE), we enter the nitty-gritty part of our guided tour and the intricacies of the generation process. Let's get back to an alternative version of the map of latent space, found in the maps entry.

There are various possible inputs to Stable Diffusion models. As represented above, the best known are text and images. When a user selects an image and a prompt, they are not sent directly to the diffusion algorithm per se (the grey area in the diagram). They are first encoded into meaningful variables. The encoding of text is often performed by an encoder named CLIP, and the encoding of images is carried on by a variational autoencoder (VAE). You can see in the diagram that a lot of operations are happening inside the grey area. This is where the work of generation proper happens. Inside this area, the latent space, the operations are not made directly on pixels, but on lighter statistical representations called latents. Before leaving that area, the generated images go again through a VAE. This time the VAE acts a decoder and is responsible to translate the result of the diffusion process, the latents, back to pixels. Encoding into latents and decoding back into pixels, VAEs are, as Efrat taig puts it, bridges between spaces, between pixel and latent spaces. [1]

On platforms such as CivitAI, users train and share bespoke VAEs. Like in the case of LoRAs, VAEs are components that users can act upon in order to improve the behaviour of a model and make it fit their needs and increase the value of their creations. The user rMada likens the effects of a VAE to overexposure or underexposure in photography. When a VAE is too coarse, it results in the loss of details. Photographers, rMada says, "use the histogram to avoid such loss".[3] Indeed, creators of synthetic images have understood very well the importance of such components of the generation pipeline. They realize how they affect one of the most sought after quality of an image: how it can increase a sense of realism by incorporating as many detailed features as possible in an image such as a portrait.

The active exchange of VAEs on genAI platforms testifies to the flexibility of the image generation pipeline when it is open source. The likes and the attention VAEs gather show how they act as currencies among communities of image creators. Their constant refinement also testifies to the platforms' success in turning the enthusiasm of amateurs into technical expertise.

[1] taig, Efrat. “VAE. The Latent Bottleneck: Why Image Generation Processes Lose Fine Details.” Medium, May 31, 2025. https://medium.com/@efrat_37973/vae-the-latent-bottleneck-why-image-generation-processes-lose-fine-details-a056dcd6015e.

[2] rMada. “VAE RAW to Obtain Greater Detail.” CivitAI, June 17, 2023. https://civitai.com/articles/462/vae-raw-to-obtain-greater-detail.

[3] rMada, “VAE RAW to Obtain Greater Detail.”

↑ Open Infrastructure article
↑ Discussed here in details http://dx.doi.org/10.1007/s00146-024-02102-y

[1] Open Infrastructure article

[:2-2] Discussed here in details http://dx.doi.org/10.1007/s00146-024-02102-y

[1]

[2]

@@ Line 55: / Line 55: @@
 <div id="model_card" class="item">
 {{:Model card}}
+</div>
+<div id="pixel_space" class="item">
+{{:Pixel space}}
 </div>

OIN booklet: Difference between revisions

Latest revision as of 14:59, 28 August 2025

Objects of interest and necessity

Further reading and writing (web/wiki-to-print)

Guestbook

CivitAI

Clip

Currencies

What is the network that sustains this object?

How does it create value? Or decrease / affect value?

Protocol as currency

GPU as currency

Dataset

Diffusion

What is the network that sustains this object?

From physics to AI, the diffusion algorithm

Stabilising diffusion

How does diffusion create value? Or decrease / affect value?

What is its place/role in techno cultural strategies?

GPU (Graphics Processing Unit)

What is a GPU?

What is the network that sustains the GPU?

From the earth to the latent space

The GPU cultural landscape and its recursive publics

A horde of graphic cards

How did GPUs evolve through time?

The geometry engine

To the moon

The prediction engine

Hugging face

What is the network that sustains Hugging Face?

How has Hugging Face evolved through time?

How does Hugging Face affect the creation of value?

What is the role of Hugging Face in techno-cultural strategies?

How does Hugging Face relate to autonomous infrastructures?

Guestbook

Interfaces to autonomous AI

What is the network that sustains the interface?

How do interfaces to autonomous AI create value?

What is their role in techno cultural strategies?

How do interfaces relate to autonomous infrastructures?

LAION

Latent space

LoRA

A technical primer

What is the network that sustains this object?

Software dependencies

LoRA as a contact zone between communities with different expertise

Curation as an operational practice

Remodelling as rewording

Negotiating the boundaries of property and identity

Another distribution of labour

Baking the cake

How does it create value? Or decrease / affect value?

Market for LoRAs, the power of demand

What is its place/role in techno cultural strategies?

How does it relate to autonomous infrastructure?

Mapping 'objects of interest and necessity'

A map of 'objects of interest and necessity' (autonomous AI image generation)

Latent space

Pixel space

An organisational plane

A material plane (GPU infrastructure)

Mapping the many different planes and dependencies of generative AI

The corporate landscape

Critical cartography in the mapping of AI

Epistemic mapping of AI

Model card

Pixel space

Prompt

What is the network that sustains this object?

How does it evolve through time?

How does it create value? Or decrease / affect value?

What is its place/role in techno cultural strategies?

How does it relate to autonomous infrastructure?

Variational Autoencoder, VAE