Card texts: Difference between revisions
Line 239: | Line 239: | ||
Further, developers may find it more appealing to document their models in other forms. In [[CivitAI]], a platform where Manga fans share their models (or ‘[[LoRA|LoRAs]]’), each model is introduced with a succinct description written in a more affective tone where the authors explain their goal, crack a joke, beg for a tip on their Patreon and thank their network of collaborators as well as the models and resources they are building on. | Further, developers may find it more appealing to document their models in other forms. In [[CivitAI]], a platform where Manga fans share their models (or ‘[[LoRA|LoRAs]]’), each model is introduced with a succinct description written in a more affective tone where the authors explain their goal, crack a joke, beg for a tip on their Patreon and thank their network of collaborators as well as the models and resources they are building on. | ||
=== Object of Interest/Necessity [1,221 characters] | === Object of Interest/Necessity [1,221 characters] A3 CHANGES === | ||
'''A4 text:''' | '''A4 text:''' | ||
With the notion of an ‘object of interest’ a guided tour of a place, a museum or collection likely comes to mind. One may easily read this compilation of texts as a catalogue for such a tour in a social and technical system, where we stop and wonder about the different objects that, in one way or the other, take part in the generation of images with Stable Diffusion. | With the notion of an ‘object of interest’ a guided tour of a place, a museum or collection'''//,//''' likely comes to mind. One may easily read this compilation of texts as a catalogue for such a tour in a social and technical system, where we stop and wonder about the different objects that, in one way or the other, take part in the generation of images with Stable Diffusion. | ||
'A guided tour' perhaps also limits the understanding of what objects of interest are? Take for instance, the famous Kepler telescope whose mission was to search the Milky Way for exoplanets. Among all the stars, there are candidates for this, or so-called Kepler Objects of Interest (KOI). | '''//Perhaps//''' 'A guided tour' '''--perhaps--''' also limits the understanding of what objects of interest are? Take for instance'''--,--''' the famous Kepler telescope'''//,//''' whose mission was to search the Milky Way for exoplanets. Among all the stars, there are candidates for this, or so-called Kepler Objects of Interest (KOI). | ||
What makes AI livable? What are the underlying dependencies on relations between communities, models, capital, technical units, and more in these technical objects of interest? | What makes AI livable? What are the underlying dependencies on relations between communities, models, capital, technical units, and more in these technical objects of interest? | ||
Line 253: | Line 253: | ||
Most people’s experiences with generative AI image creation come from platforms like OpenAI’s DALL-E or other services. Nevertheless, there are also communities who for different reasons seek some kind of independence and autonomy from the mainstream platforms. The outset for this catalogue is ‘Stable Diffusion’, a so-called Free and Open Source Software system for AI image creation. | Most people’s experiences with generative AI image creation come from platforms like OpenAI’s DALL-E or other services. Nevertheless, there are also communities who for different reasons seek some kind of independence and autonomy from the mainstream platforms. The outset for this catalogue is ‘Stable Diffusion’, a so-called Free and Open Source Software system for AI image creation. | ||
With the notion of an ‘object of interest’ a guided tour of a place, a museum or collection likely comes to mind. One may easily read this compilation of texts as a catalogue for such a tour in a social and technical system, where we stop and wonder about the different objects that, in one way or the other, take part in the generation of images with Stable Diffusion. | With the notion of an ‘object of interest’ a guided tour of a place, a museum or collection'''//,//''' likely comes to mind. One may easily read this compilation of texts as a catalogue for such a tour in a social and technical system, where we stop and wonder about the different objects that, in one way or the other, take part in the generation of images with Stable Diffusion. | ||
'A guided tour' perhaps also limits the understanding of what objects of interest are? In science, for instance, an object of interest sometimes refers to what one might call the potentiality of an object. Take for instance, the famous Kepler telescope whose mission was to search the Milky Way for exoplanets (planets outside our own solar system). Among all the stars, there are candidates for this, or so-called Kepler Objects of Interest (KOI). | 'A guided tour' perhaps also limits the understanding of what objects of interest are? In science, for instance, an object of interest sometimes refers to what one might call the potentiality of an object. Take for instance, the famous Kepler telescope whose mission was to search the Milky Way for exoplanets (planets outside our own solar system). Among all the stars, there are candidates for this, or so-called Kepler Objects of Interest (KOI). |
Revision as of 14:49, 17 September 2025
// THESE are the texts are for the A4 + A3 printouts - approximately 1,300 characters
// A4: All of the objects
// A3: Objects/Maps – Pixel Space/Latent space – Prompt/LoRA – GPU/Currency – Stable Horde/Hugging Face
// Pair the texts with the icons for A4 printouts
CivitAI [1,371 characters]
A4 text:
CivitAI is a collaborative hub for AI imaging. It attracts a huge numbers of enthusiasts, who share their productions in image galleries, and celebrates the capabilities of generative AI to clone and reimagine every style and cross every genre, from cartoon to oil painting and fashion photography to extreme pornography, as if the users were on a mission to exhaust the pixel space.
Users also upload custom-made, fine-tuned, models (LoRAs and VAEs) along with detailed tutorials. A large population of anime fans is responsible for an endless list of models that specialise on a given manga character, for instance. This makes CivitAI a bridge between fans and computer geeks (sometimes both) who enjoy the platform's lax moderation, which unfortunately does little to prevent various forms of abuse.
Many models are available on the platform's own image generator as well as for free download, therefore finding their ways to private desktops as well as peer-to-peer networks, such as Stable Horde.
CivitAI supports its operations through various commercial offers and venture capital. All in all, the company exemplifies the paradoxical attachments of autonomous AI: it serves indeed the bottom-up production and 'democratisation' of AI technology in a way that goes beyond mere consumer usage, but it does so by converting the care and labour of a large population of enthusiasts into capital.
CLIP [1,372 characters]
A4 text:
CLIP (contrastive language-image pre-training) is largely unknown to the general public. Like the VAE, it is used to encode input into embeddings – statistical representations, that can be operated upon in the latent space. For instance to retrieve or classify images.
Its use as a secondary component in the image generation pipeline by LAION and others shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. CLIP was first released in 2021 by OpenAI under an open source license, just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license. If technically, CLIP bridges prompts and the latent space, politically it travels between proprietary and open source ecosystems.
If we compare a component like CLIP to another like the VAE, we see how elements that perform similar technical functions allow for strikingly different forms of social appropriations. Amateurs train and retrain VAEs to improve image realism whereas CLIP, that has been trained on four hundred million text-to-image pairs, cannot be retrained without incurring exorbitant costs. Therefore, the overwhelming presence of CLIP is due to its open licensing rather than its flexibility. The sheer cost of its production makes it a black box even for advanced users, and its inspection and customization out of reach.
Currencies [1,288 characters] NEEDS PROOFING!
A4 text:
While a familiar object, currencies can take wild shapes. Traditionally, a currency is a material medium (commonly, metal or paper) of exchange: coins and bills are one//DELETE// still a common form of currency. Digital platforms and services, however, have multiplied the forms these exchange objects can take. Videogames and blockchain technology have helped to explode what can be understood as currency, and have subverted the dependants on larger organisations: if traditionally states, banks, and large organisations designed and managed these objects, their digital counterparts tend to be more untamed and sometimes community orientated. Autonomous currencies have always existed, // DELETE , // as exchange modes for small communities or explicit counter-objects to hegemonic economies, but their digital versions are easy to set-up, distribute, and share with communities around the world. Within our objects of interest, digital currencies make possible to directly generate images, or commodify the possibility of new ones (for example, as a bounty for the ones creating or fine-tuning models). Immaterial versions of currencies also act as the exchange network for sharing non-hegemonic networks of image generation, adding value and circulation to a larger infrastructure of imaginaries of autonomy.
A3 texts:
Body texts:
Most currencies are based on a network of interest that agrees to assign value to an object or system.
While we use fiat (euros, crowns, etc) for most of our daily lives, social structures work on a sometimes invisible and highly complex mesh of systems of value and exchange.
NVIDIA's market capitalization
With the expansion of LLMs and AI-orientated platforms, scarcity has moved towards hardware capable of training, operating, and fine-tuning LLMs. That has made the GPU a holy grail of hardware.
Buzz", the currency for CivitAI (one of the largest marketplaces for generative AI content), acts as a reward for the user's interaction with content, as a tip for content creators, and even as a 'bounty' for specific requests. One can even beg for currency.
The economies of LLMs are not restricted to big tech. The hordeAI acts as a barter system of sorts, with its own currency named 'kudos'. It encourages lending a graphical device to produce images for someone else within the network.
The installation of a node of Stable AI in a Xenofeminist workshop in Madrid.
Protocol as ccurrency
Instead of relying on a central, trustable, institution, like a bank or a government, bitcoin off-phased trust and accountability to a mathematically-governed distributed system. Technically, the system guarantees accurate transactions between any party, without any central management.
Bottom text
Same as A4 text above
Datasets [1,279 characters]
A4 text:
In the context of AI image generation, a dataset is a collection of image-text pairs (and sometimes other attributes such as provenance or an aesthetic score) used to train AI models. It is an object of necessity par excellence. Without datasets, no model could see the light of day. Iconic datasets include the LAION aesthetic dataset, Artemis, ImageNet, or Common Objects in Context (COCO). These collections of images, mostly sourced from the internet, reach dizzying scales of billions of image-text pairs.
Dataset creation implies a specific curatorial practice that involves a wide range of skills including a deep knowledge of the relevant visual domain, the ability to find the best exemplars, many practical skills such as scraping, image filtering, cleaning and cropping, and mastering the art of a coherent classification and annotation.
If large models such as Stable Diffusion require large scale datasets, various components such as LoRAs, VAEs, refiners, or upscalers can be trained with a much smaller amount of data. In practice, each dataset reflects a particular aspect of visual culture and acts as a conductor for imaginaries and world views.
Behind each dataset there is an organisation - of people, communities, corporate structures, public institutions, researchers, or others.
Diffusion [1,346]
A4 text:
Rather than a mere scientific object, diffusion can be treated as a network of meanings that binds together a technique from physics (diffusion), an algorithm for image generation, a model (Stable Diffusion), an operative metaphor relevant to cultural analysis and by extension a company (Stability AI) whose foundation have roots in hedge fund investment.
In physics, the diffusion phenomenon describes the movement of particles from an area of higher concentration to a lower concentration area till an equilibrium is reached. To understand how this idea has been translated in image generation, imagine an image and gradually apply noise to it until it becomes totally noisy; then train an algorithm to 'learn' all the steps that have been applied to the image and ask it to apply them in reverse to find back the image. The algorithm detects tendencies in the noise and gradually follows and amplifies them in order to arrive to a point where an image emerges.
When the algorithm is trained with billions of examples, it becomes able to generate an image from any arbitrary noisy image. And the most remarkable aspect of this process is that the algorithm is able to de-noise images that it never ‘saw’.
The diffusion can be guided by text prompts and images. In this sense, diffusion is stabilised through different forms of guidance.
GPU [1,028 characters] NEEDS PROOFING!
A4 text:
The Graphics Processing Unit (GPU) is an electronic circuit for processing computer graphics. It was designed to add graphical interfaces to computing devices and expand the possibilities for interaction. While commonly used for videogames or 3D computer graphics enterprises, the GPU's particular way of computing have made it an object of interest and necessity for the cryptocurrency rush and, more recently, for training and use of generative AI. // As one of the more material elements of our map, it is often an invisible hardware, but plays a crucial role in translating between visual/textual human-understanding and computer models for prediction // (for example, between existing images and the 'calculation' of new ones).
The GPU is both an object that sits next to a desktop computer, and // an object // that populates massive cloud data centres racing towards the latest flagship large language model. In a way, a domestic object, used by enthusiast to play and modify the landscape of AI, and the holy grail of the big tech AI industry.
A3 text:
Body texts:
Well-known among gamers, // DELETE EXTRA SPACE // GPUs are graphic cards that allow for the fast computation of cryptocurrencies and now synthetic images
Do graphic processing units have politics?
The GPU is a translation piece: most of the calculations to and from 'human-readable' objects, like a painting, a photograph or a piece of text in pixel space, into an n-number of matrices, vectors, and coordinates, or latent space, are made possible by the GPU. It is a transition object, a door into and out of a grey-boxed,// DELETE ,// space.
Gefion – Denmark's new AI supercomputer, launched in October 2024 – is powered by 1,528 GPUs developed by Nvidia, arguably paveing // INSERT SPACE // the way for Denmark's AI sovereignty.
The canonicall // canonical // neural network for "alexnet" was trained using 2 GPUs in 2012. It inaugurated the mainstream use of GPUs for machine learning. Some years later, 10,000 of cards were used to train GPT-3 using a dataset of 3 trillion words.
Their life-cycle and supply chain locate GPUs into a material network with the same issues of other chips and circuits: conflict minerals, labour rights, by-products of manufacturing and distribution, and waste. When training or generating AI responses, the GPU is the object that consumes the most energy in relation to other computational components.
Managing an infrastructure of hardware centered on the GPU at large scale is a highly sought expertise. It gives a company such as Hugging Face a central role in the genAI ecosystem.
GPUs can also be shared // INSERT , // as in Stable Horde. The card on the // INSERT SPACE // right was lent to different users to create their own images. By sharing GPUs, capabilities can also scale. But not // all // scales are the same.
Bottom text:
Same as A4.
Hugging Face (1,192 characters) NEEDS PROOFING!
A4 text
Hugging Face initially started out in 2016 out as a chatbot for teenagers, but is now the collaborative hub for AI development – for image creation, speech synthesis, text-to-video. image-to-video, and much more. As a platform, it serves as an infrastructure for experimentation with advanced configurations that the conventional platforms do not offer, and is used by both amateurs and professionals alike.
By making AI models and datasets publicly available it can also be understood as an attempt to democratise AI and delink from the mainstream platforms. Yet, at the same time Hugging Face is deeply intertwined with commercial interests. It collaborates with Meta and Amazon Web Services who wants to take advantage of the company's expertise in handling AI models at large scale, and it has received investments from Amazon, Google, Intel, IBM, NVIDIA and others within the AI industry.
Hugging Face is a key example of how generative AI - also when seeking autonomy – depends on a specialised technical infrastructure. In a constantly evolving field reliability, security, scalability and adaptability become important parameters, and Hugging Face offers this in the form of a platform.
A3 texts:
Body texts
Hugging Face is a collaborative hub for AI development
From friendly chatbot to AI linchpin
Coordinating high tech companies and myriads communities, Hugging Face is hailed as an attempt to democratise AI. At the same time, it is deeply intertwined with the fiancial dark matter of venture capital. It is therefore suspended between more autonomous and peer-based communities of practice, and a need for more 'client-server' relations in model training, which generally is dependent on 'heavy' resources (stacks of GPUs) and specialised expertise.
It gives communities access to resources and infrastructures and it gives companies the means to extract knowledge from the communities and buy themselves back from these attachments.
Bottom text
Hugging Face initially started out in 2016 out as a chatbot for teenagers, but is now a (if bot the) collaborative hub for AI development – not specifically targeted AI image creation, but generative AI more broadly (including speech synthesis, text-to-video. image-to-video, image-to-3D, and much more). It attracts amateur developers who use the platform to experiment with AI models, as well as professionals who typically use the platform as an outset for entrepreneurship. By making AI models and datasets available it can also be labelled as an attempt to democratise AI and delink from the key commercial platforms, yet at the same time Hugging Face is deeply intertwined with these companies and various commercial interests.
Hugging face has (as of 2023) an estimated market value of $4,5 billion. It has received high amounts of venture capital from Amazon, Google, Intel, IBM, NVIDIA and other key corporation in AI and has, because of the company's expertise in handling AI models at large scale, also collaborations with both Meta and Amazon Web Services. Yet, at the same time, it also remains a platform for amateur developers and entrepreneurs who use the platform as an infrastructure for experimentation with advanced configurations that the conventional platforms do not offer.
Interfaces (1,286 characters)
A4 text:
Many users are familiar with mainstream corporate interfaces, such as Microsoft Image Creator or OpenAI's DALL-E. They are typically cloud based and easy to use – often just with a simple ‘prompt.’ Interfaces to autonomous AI vary significantly from this:
- There are many options depending on the configuration of one's own computer. Draw Things is a graphical interface suitable for MacOS users; ComfyUI works for all systems, ArtBot has an advanced web interface and also integration with Stable Horde (a peer-based infrastructure of GPUs).
- There are plenty of parameters to experiment with. Users can prompt, but also define negative prompts (what not to include the image); combine different models (including one's own), choose the size of the image, and more.
- They allow users to generate their own models based on the models of Stable Diffusion, aka LoRAs. LoRAs are used to make images in, say, a particular manga style, and are shared on dedicated platforms.
- They can be integrated into one's own applications. The developer platform Hugging Face, for instance, releases models with different licences, for integration into new tools and services.
In short, they do not only demand insights into how models work. But also deep knowledge of visual culture and aesthetics.
LAION [1,328 characters]
A4 text:
If our tour of autonomous AI imaging leads to well-funded companies such as Hugging Face or CivitAI, it also leads to the opposite end of the financial spectrum. The Large-scale Artificial Intelligence Open Network (LAION) fits in this category. It is a non-profit organization operating with small donations. Its ambition is to democratize AI by reusing existing datasets and models, making them publicly available for use and education. LAION is the backbone of many image diffusion models.
To train a model, images need to be paired with descriptions. For smaller datasets, this is done manually – crowdsourced or by enthusiasts. LAION instead used an index of webpages compiled by the non-profit Common Crawl to find html documents with <img> tags and extracted their Alt Text (a descriptive text included in the image data to increase accessibility for visually impaired).
Here the non-profit status of the organization plays an ambiguous role. LAION benefits from an exception granted by the EU Data Mining directive for scientific research and has a legal right to build datasets and models on other people’s images and texts, including artists. If this is true for LAION itself, can the same be said for the commercial parties interested in the models?
Latent space
A4 text - Same as A3 bottom or this shorter version [1,355 characters]:
Latent space is a highly abstract space consisting of compressed representations of images and texts. A key object is the Variational Autoencoder (VAE) that makes the image-texts available to different kinds of operations – whose results are then decoded back into images. An important operation happening in the latent space is the training of an algorithm. In diffusion-based algorithms, the algorithm is trained by learning to apply noise to an image and then reconstruct an image, from complete or random noise (this process is discussed more in-depth in the entry on diffusion).
To continue our mapping, it is important to note that the latent space is nurtured by various sources. Many of the datasets that are used to train models are made by 'scraping' the internet, while others are built on repositories like Instagram, flickr, or Getty images. When it comes to autonomous AI imaging, there is typically an organisation and a community behind each dataset and training. LAION (Large-scale-Artificial Intelligence Open Network) is a non-profit organisation that develops and offers free models and datasets. Stable Diffusion was trained on datasets created by LAION, using Common Crawl (another non-profit organisation that has built a repository of 250 billion web pages) and CLIP (OpenAI's neural network which learns visual concepts from natural language supervision) to compile an extensive record of links to images with 'alt text' (a descriptive text for non-text content, created by 'web masters' for increased accessibility) – --that is-- a useful set of annotated images, to be used for model training. We begin to see that a model's dependencies have large organisational, social and technical ramifications.
A3 poster
Body text:
Technically, ‘diffusion’ is a computational process that involves iteratively removing ‘noise’ from an image, a series of mathematical procedures that leads to the production of another image.
Bottom text
Latent space is a highly abstract space consisting of compressed representations of images and texts. A key object is the Variational Autoencoder (VAE) that makes the image-texts available to different kinds of operations – whose results are then decoded back into images. An important operation happening in the latent space is the training of an algorithm. In diffusion-based algorithms, the algorithm is trained by learning to apply noise to an image and then reconstruct an image, from complete or random noise.
To continue our mapping, it is important to note that the latent space is nurtured by various sources. Many of the datasets that are used to train models are made by 'scraping' the internet, while others are built on repositories like Instagram, flickr, or Getty images. When it comes to autonomous AI imaging, there is typically an organisation and a community behind each dataset and training. LAION (Large-scale-Artificial Intelligence Open Network) is a non-profit organisation that develops and offers free models and datasets. Stable Diffusion was trained on datasets created by LAION, using Common Crawl (another non-profit organisation that has built a repository of 250 billion web pages) and CLIP (OpenAI's neural network which learns visual concepts from natural language supervision) to compile an extensive record of links to images with 'alt text' (a descriptive text for non-text content, created by 'web masters' for increased accessibility) – --that is-- a useful set of annotated images, to be used for model training. We begin to see that a model's dependencies have large organisational, social and technical ramifications.
BOTTOM TEXT ALTERNATIVE [1,355 characters] --NOT PROOFED:
Latent space is where diffusion-based algorithms are trained and images generated – by applying noise to an image and then reconstruct an image from complete or random noise (aka diffusion). These operations are not made directly on pixels, but on lighter statistical representations called ‘latents’.
It is important to note that the latent space is nurtured by various sources.There are models used to bridge prompts and the latent space (such as CLIP or VAE), but models are also used in a pipeline. Some to get rid of too many fingers, others to add visual styles – each acting as a conductor for imaginaries and world views, each with its own organisation and community.
Stable Diffusion was trained on datasets created by LAION (Large-scale-Artificial Intelligence Open Network) – a non-profit organisation that develops and offers free models and datasets. They used Common Crawl (an open repository of 250 billion web pages) and CLIP (OpenAI's neural network which learns visual concepts from natural language supervision) to compile an extensive record of links to images with 'alt text' (a descriptive text for non-text content, created for increased accessibility) – that is, a useful set of annotated images, to be used for model training. We begin to see that a model's dependencies have large organisational, social and technical ramifications.
LoRA --- NEEDS PROOFING!
1272 chars incl spaces.
A4 text: Large image models can be used to generate images in many styles. But they may show limitations when a user wants a specific output such as a particular genre of manga or when an improvement is needed for some details such as specific hand positions or to produce legible text. This is where LoRAs come in. A LoRA is a smaller model created with a technique that makes it possible to improve the performance of a base model on a given task. The process of LoRA training is very similar to training a model, but at a different scale. Amateurs and advanced users who risk themselves in this adventure can count on the support of various channels of support, from Discord channels to dedicated platforms.
The existence of LoRAs evokes the possibility of a re-appropriation of the model via fine-tuning. Even if not complete, this helps users regain some form of autonomy from large model providers. In particular because their needs are defined bottom-up. The largest gain is in term of literacy and understanding of the training process more generally. Indeed, as LoRAs are miniature models, the skills and expertise related to curation, sourcing, annotation, and model semantics are being developed through a peer to peer effort in communities of amateurs and image makers.
A3 text
Body text:
A LoRA is a smaller model created with a technique that makes it possible to improve the performance of a base model on a given task. // fine-tune a base mode to meet specific needs. //
"In case of V6 this number was ~20k manually labeled images. Now, we need someone who can look at images and use their art critique skills to judged on the scale we invented. And who is that impartial person, unbiased and neutral, able to make decisions or judgments based on objective criteria rather than personal feelings, interests, or prejudices? It's me, obviously. So, after spending weeks in data labeling cave methodically ranking each image I was able to generate our aesthetic dataset large enough to be useful."
Prompting a model with the query "The artist Louise Bourgeois working in her studio" resulted in images of an older woman dressed in a contemporary attire with a vague physical likeness to the artist. After assembling a dataset from online and offline images, we trained a LoRA locally. The facial expression with the ironic smile characterizing // characterising // Bourgeois was now present. The general texture of the artworks surrounding Bourgeois were also closer to her work //,// although they remained rather academic in style. Instead of showing her drawing at the studio's table, the model was now showing her in contact with the sculptures.
The CivitAI user BigHeadTF has selected a few pictures he created with his LoRA. Hulk is in turn // DELETE // depicted as cajoling a teddy bear or crossdressing as Shrek's Princess Fiona. The images play with the contrast between Hulk's overblown virility and childlike or female connotations. The images demonstrate the model's ability to expand the hero's universe into other registers or fictional worlds. "The Incredible Hulk (2008)" doesn't just reproduce faithfully existing images of Hulk, it also opens new avenues for creation and combinations for the green hero.
Bottom text:
Large image models can be used to generate images in many styles. But they may show limitations when a user wants a specific output such as a particular genre of manga or when an improvement is needed for some details such as specific hands positions or to produce legible text. This is where LoRAs come in. A LoRA is a smaller model created with a technique that makes it possible to improve the performance of a base model on a given task. The process of LoRA training is very similar to training a model, but at a different scale. Amateurs and advanced users who risk themselves in this adventure can count on the support of various channels of support, from Discord channels to dedicated platforms.
The existence of LoRAs evokes the possibility of a re-appropriation of the model via fine-tuning. Even if not complete, this helps users regain some form of autonomy from large models providers. In particular because their needs are defined bottom-up. The largest gain is in term of literacy and understanding of the training process more generally. Indeed as LoRAs are miniature models, the skills and expertise related to curation, sourcing, annotation, and model semantics are being developed through a peer to peer effort in communities of amateurs and image makers.
Maps [1,322 characters] NEEDS PROOFING!
A4 text:
There is little knowledge of what AI really looks like. Perhaps because of this lack of insight, there is an abundance of maps – corporate landscapes, organisational diagrams, technical workflows, etc. – used to design, navigate, or criticise AI’s being in the world. If one does not confuse the map with the territory (reality with its representation), one begins to see how externalising abstractions of AI by way of cartography takes part in making AI a reality.
The maps presented here attempt to abstract the different objects that one comes across when entering the world of autonomous AI imaging. It can serve as a practical guide to understand what the objects of this world are called, how they connect to each other, communities or underlying infrastructures – perhaps also as an outset for one's own abstractions.
A distinction between what you see and what you do not see can be useful. Latent space refers to the invisible space that exists between the capture of images in datasets and the generation of new images. Pixel space is where one encounters objects of visual culture – such as the interfaces or generated images. But models, datasets, interfaces, images and other objects also exist in other planes of abstractions – for instance, of their material infrastructure or their organisation on platforms.
On the A3 poster (everything is in the main body):
There is little knowledge of what AI really looks like. The maps presented here are an attempt to abstract the different objects that one may come across when entering the world of autonomous and decentralised AI image creation. It can serve as a useful guide to experience what the objects of this world are called, how they connect each other, to communities or underlying infrastructures – perhaps also as an outset for one's own exploration. A distinction between 'pixel space' and 'latent space' can be helpful. That is, what you see from what you do not see.
Latent space refers to the invisible space that exists between the capture of images in datasets and the generation of new images. Images are encoded with 'noise', and the machine then learns how to how to de-code them back into images (aka 'image diffusion'). Contrary to common belief, there is not just one dataset used to make image generation work, but multiple models and datasets to 'upscale' images of low resolution, 'refine' the details in the image, and much more. Behind every model and dataset there is a community and organisation.
Pixel space is where one encounters objects of visual culture. Large-scale datasets are for instance compiled by crawling and scraping repositories of visual culture, such as museum collections. Whereas conventional interfaces for generating images only offer the possibility to 'prompt', interfaces to Stable Diffusion offer advanced parameters, as well as options to train one's own models, aka LoRAs. This demands technical insights into latent space as well as aesthetic/cultural understandings of visual culture (say, of manga, gaming or art).
Both images and LoRAs are organised and shared on dedicated platforms (e.g., Danbooru or CivitAI). The generation of images and use of GPU/hardware can also be distributed to a community of users in a peer-to-peer network (Stable Horde). This points to how models, software, datasets and other objects always also exist suspended between different planes of dependencies - organisational, material, or other.
Model card [1,308 characters]
A4 text:
As models begin to pile up in open repositories like Hugging Face, model cards have emerged as a means to document them. Think about model cards as nutrition labels for models. Ideally, they list the model's ingredients, how it was trained and its validation procedures as well as its intended use and limitations.
Whilst code repositories cannot force their documentation protocol upon the users, they automatically create an empty model card when a new model is uploaded in an effort to encourage standardization and transparency. Sometimes, model creators thoroughly document the model with a reference to an academic paper, sometimes they offer only minimal information or simply leave the model card empty. In that, model cards testify to the diverse nature of model providers. Some are working in computer science labs or in companies, others are amateurs with little time left or patience for this tedious work, or simply have no desire to share.
Further, developers may find it more appealing to document their models in other forms. In CivitAI, a platform where Manga fans share their models (or ‘LoRAs’), each model is introduced with a succinct description written in a more affective tone where the authors explain their goal, crack a joke, beg for a tip on their Patreon and thank their network of collaborators as well as the models and resources they are building on.
Object of Interest/Necessity [1,221 characters] A3 CHANGES
A4 text:
With the notion of an ‘object of interest’ a guided tour of a place, a museum or collection//,// likely comes to mind. One may easily read this compilation of texts as a catalogue for such a tour in a social and technical system, where we stop and wonder about the different objects that, in one way or the other, take part in the generation of images with Stable Diffusion.
//Perhaps// 'A guided tour' --perhaps-- also limits the understanding of what objects of interest are? Take for instance--,-- the famous Kepler telescope//,// whose mission was to search the Milky Way for exoplanets. Among all the stars, there are candidates for this, or so-called Kepler Objects of Interest (KOI).
What makes AI livable? What are the underlying dependencies on relations between communities, models, capital, technical units, and more in these technical objects of interest?
Objects also contain an associative power, that literally can create memories and make a story come alive. This texts are therefore not just a collection of objects that makes generative AI images, but an exploration of an imaginary of AI image creation through the collection and exhibition of objects – and in particular, an imaginary of ‘autonomy’ from mainstream capital platforms.
On the A3 poster (everything is in the main body):
Most people’s experiences with generative AI image creation come from platforms like OpenAI’s DALL-E or other services. Nevertheless, there are also communities who for different reasons seek some kind of independence and autonomy from the mainstream platforms. The outset for this catalogue is ‘Stable Diffusion’, a so-called Free and Open Source Software system for AI image creation.
With the notion of an ‘object of interest’ a guided tour of a place, a museum or collection//,// likely comes to mind. One may easily read this compilation of texts as a catalogue for such a tour in a social and technical system, where we stop and wonder about the different objects that, in one way or the other, take part in the generation of images with Stable Diffusion.
'A guided tour' perhaps also limits the understanding of what objects of interest are? In science, for instance, an object of interest sometimes refers to what one might call the potentiality of an object. Take for instance, the famous Kepler telescope whose mission was to search the Milky Way for exoplanets (planets outside our own solar system). Among all the stars, there are candidates for this, or so-called Kepler Objects of Interest (KOI).
In similar ways, this catalogue is the outcome of an investigative process where we – by trying out different software, reading documentation and research, looking into communities of practice that experiment with AI image creation, and more – have sought to understand the things that make generative AI images with Stable Diffusion possible. We have tried to describe not only the objects, but also their underlying dependencies on relations between communities, models, capital, technical units, and more.
Objects, however, also contain an associative power, that literally can create memories and make a story come alive. This catalogue is therefore not just a collection of the objects that makes generative AI images, but an exploration of an imaginary of AI image creation through the collection and exhibition of objects – and in particular, an imaginary of ‘autonomy’ from mainstream capital platforms.
Pixel space --- NEEDS PROOFING!
[1615 characters inc. spaces]
A4 text (copied from card text)
In pixel space, you find a range of visible objects that a typical user would normally meet. This includes the interfaces for creating images. In conventional interfaces like DALL-E or Bing Image Creator, users prompt in order to generate images. What is particular for autonomous and decentralised AI image generation is that the interfaces have many more parameters and ways to interact with the models that generate the images. It functions more like an 'expert' interface.
In pixel space one finds many objects of visual culture. Apart from the interface itself, this includes both all the images generated by AI, and all the images used to train the models behind. These images are, as described above, used to create datasets, compiled by crawling the internet and scraping images that all belong to different visual cultures – ranging, e.g., from museum collections of paintings to criminal records with mug shots.
Many users also have specific aesthetic requirements to the images they want to generate. Say, to generate images in a particular manga style or setting. The expert interfaces therefore also contains the possibility to combine different models and even to post-train one's own models, also known as a LoRA (Low-Rank Adaptation). When sharing the images on platforms like Danbooru (one of the first and largest image boards for manga and anime) images are typically well categorised – both descriptively ('tight boots', 'open mouth', 'red earrings', etc.) and according to visual cultural style ('genshin impact', 'honkai', 'kancolle', etc.). Therefore they can also be used to train more models.
A3 poster
Bottom text
In pixel space, you find a range of visible objects that a typical user would normally meet. This includes the interfaces for creating images. In conventional interfaces like DALL-E or Bing Image Creator, users prompt in order to generate images. What is particular for autonomous and decentralised AI image generation is that the interfaces have many more parameters and ways to interact with the models that generate the images. It functions more like an 'expert' interface.
In pixel space one finds many objects of visual culture. Apart from the interface itself, this includes both all the images generated by AI, and all the images used to train the models behind. These images are, as described above, used to create datasets, compiled by crawling the internet and scraping images that all belong to different visual cultures – ranging, e.g., from museum collections of paintings to criminal records with mug shots.
Many users also have specific aesthetic requirements to the images they want to generate. Say, to generate images in a particular manga style or setting. The expert interfaces therefore also contains the possibility to combine different models and even to post-train one's own models, also known as a LoRA (Low-Rank Adaptation). When sharing the images on platforms like Danbooru (one of the first and largest image boards for manga and anime) images are typically well categorised – both descriptively ('tight boots', 'open mouth', 'red earrings', etc.) and according to visual cultural style ('genshin impact', 'honkai', 'kancolle', etc.). Therefore they can also be used to train more models.
Prompt --- NEEDS PROOFING!
1678 chars incl spaces
A4 text In theory, a prompt is meant to be simple. You supply the instruction and the model does the rest. In practice however the story is a little different. By interpreting the prompt, the system supposedly 'reads' what is in the user's mind. Interpreting the prompt involves much more than a literal translation of a string of words into pixels. It is the interpretation of the meaning of these words with all the cultural complexity this entails. As, historically, prompts were limited in size, this work of interpretation was performed on the basis of a very minimal description. Often with a syntax reduced to a comma-separated list or a string of tags. Even now, with extended descriptions, the model is still tasked to fill the blanks. As the model tries to make do, it inevitably reveals its own bias. Through prompting, the user gradually develops a feel for the model's singularity. They elaborate semantics to work around perceived limitations and devise targeted keywords to which a particular model responds.
Prompts are rarely interpreted directly by the model. They go through a series of checks before being rendered. They are treated as sensitive and potentially offending. This has motivated different forms of censorship by mainstream platforms and it has propelled in return the development of many strategies aiming at gaming the model. One strong motivation to adopt an autonomous infrastructure is avoiding censorship. Even if models are carefully trained to stay in the fold, working with a local model allows to do away with many layers of platforms censorship. For better or for worse, prompting the model locally means deciding individually what to censor.
A3 poster texts
Body texts
A prompt is a string of words meant to guide an image generator in the creation of an image.
The advocado armchair early prompt demonstrated that the model was able to produce something that didn't exist in its training set.
Instead of fixing the model's bias, Dall-e used to append words to the user's prompt to orient the result towards a more "diverse" output.
Resolution is a marker of quality
Style references reflecting the cultural embedding of the model
Reference to the source of the model's dataset
Early models reacted better to lists of tags, they evolve towards natural language under the influence of LLMs
8k resolution, beautiful, cozy, inviting, bloomcore, decopunk, opulent, hobbit-house, luxurious, enchanted library in giverny flower garden, lily pond, detailed painting, romanticism, warm colors, digital illustration, polished, psychadelic, matte painting trending on artstation
Even if models are carefully trained to stay in the fold, working with a local model allows to do away with many layers of platforms censorship. For better or for worse, prompting the model locally means deciding individually what to censor.
Our experiment, reflexive prompt
Two images generated with the model EpicRealism, with a prompt which inverts the traditional gender roles and asks for the picture of a man washing the dishes. The surreal results testify to the degree to which the model internalizes the division of labour in the household.
Bottom text
Stable Horde [1,299 characters] CHANGES TO A3!
A4 text
Horde AI or Stable Horde is a distributed cluster of GPUs. The project describes itself as a "volunteer crowd-sourced distributed cluster of image and text generation workers". This translates as --s-- //a// network of individual GPU users that "lend" their devices and store--d-- large language models. This means that one can generate an image from any device connected to this network through an interface, e.g. a website through a phone. While the visible effects are the same as using chatGPT, co-pilot or any other proprietary service, the images in this network are "community" generated. The request is not sent to a server farm or a company, but to a user that is willing to share their GPU power and stored models. Haidra, the non-profit associated with HordeAI, seeks to make AI free, open-source, and collaborative, effectively circumventing the reliance on AI bi--t--//g//-tech players.
Projects like Stable Horde/HordeAI offer a glimpse into the possibilities of autonomy in the world of image generation, and offer other ways of volunteering through technical means. In a way, this project inherits some of the ethos of P2P sharing and recursive publics, yet updated for the world of LLMs. The GPU used in this project is (intermittently), part of the HordeAI network, generati//n//g and using the kudos currency.
A3 poster
Body texts:
Stable Horde is based on a form of technical collaboration with its own model of exchange .
Kudos, the platform's currency, functions as a barter system, where the main material is GPU processing power.
When a user lends their machine to the project, it becomes a worker. Its graphic card can be called to generate images for other users.
Each worker decides which model they run and what kind of content they allow. For instance, a worker can refuse to generate NSFW images.
Well-known for gamers, GPUs are graphic cards that allow for the fast computation of cryptocurrencies and now synthetic images//.//
The project is based on the values of mutual help and inspired by anarchist principles. It is an effort to delink from mainstream platforms and explore new forms of solidarity//.//
AI horde is a project that evolved from groups interested in role-playing. The name horde is a reference to the Kobold monster from the game Advanced Dungeons & Dragons.
Variational Autoencoder, VAE [1,333 characters]
A4 text:
The variational autoencoder (VAE) belongs to the nitty-gritty part of latent space.
When a user selects an image and a prompt as an input, they are not sent directly to the diffusion algorithm --per se--. They are first encoded into meaningful variables. The encoding of text is often performed by an encoder named CLIP, and the encoding of images is carried on by a variational autoencoder (VAE).
These operations are not made directly on pixels, but on lighter statistical representations called ‘latents’. To turn back into images, the process leads through the VAE once again. This time the VAE acts a decoder and is responsible to translate the result of the diffusion process, the latents, back to pixels. Encoding into latents and decoding back into pixels, VAEs are bridges between spaces, between pixel and latent spaces.
On platforms such as CivitAI, users train and share VAEs. Like in the case of LoRAs, VAEs are components that users can act upon in order to improve the behaviour of a model and make it fit their needs and increase the value of their creations//.// The active exchange of VAEs on genAI platforms testifies to the flexibility of the image generation pipeline when it is open source//d//. Their constant refinement also testifies to the platforms' success in turning the enthusiasm of amateurs into technical expertise.