Diffusion: Difference between revisions

Revision as of 16:00, 25 August 2025

Diffusion

Rather than a mere scientific object, diffusion is treated here as a network of meanings that binds together a technique from physics (diffusion), an algorithm for image generation, a model (Stable Diffusion), an operative metaphor relevant to cultural analysis and by extension a company (Stability AI) and its founder with roots in hedge fund investment.

In her text "Diffused Seeing", Joanna Zylinska aptly captures the multivalence of the term:

... the incorporation of ‘diffusion’ as both a technical and rhetorical device into many generative models is indicative of a wider tendency to build permeability and instability not only into those models’ technical infrastructures but also into our wider data and image ecologies. Technically, ‘diffusion’ is a computational process that involves iteratively removing ‘noise’ from an image, a series of mathematical procedures that leads to the production of another image. Rhetorically, ‘diffusion’ operates as a performative metaphor – one that frames and projects our understanding of generative models, their operations and their outputs.[1]

In complement to Zylinska's understanding of diffusion as a term operating at different levels with an emphasis on permeability, we inquire into the dialectical relation that opposes it to stability (as interestingly emphasized in the name Stable Diffusion), where the permeability and instability enclosed in the concept constantly motivates strategies of control, direction, capitalization or democratization that leverage the unstable character of diffusion dynamics.

What is the network that sustains this object?

From physics to AI, the diffusion algorithm

Our first move in this network of meanings is to follow the trajectory of the concept of diffusion from the 19th century laboratory to the computer lab. If diffusion had been studied since antiquity, Adolf Fink published the first laws of diffusion" based on his experimental work in 1855. As Stanford AI researchers Russakovsky et al put it:

In physics, the diffusion phenomenon describes the movement of particles from an area of higher concentration to a lower concentration area till an equilibrium is reached [1]. It represents a stochastic random walk of molecules.[2] MISSING REFERENCE

To understand how this idea has been translated in image generation, it is worth looking at the example given by Sohl-Dickstein and colleagues who authored the seminal paper on diffusion in image generation.[3] The author propose the following experiment: take an image and gradually apply noise to it until it becomes totally noisy; then train an algorithm to 'learn' all the steps that have been applied to the image and ask it to apply them in reverse to find back the image (see illustration). By introducing some movement in the image, the algorithm detects some tendencies in the noise. It then gradually follows and amplifies these tendencies in order to arrive to a point where an image emerges. When the algorithm is able to recreate the original image from the noisy picture, it is said to be able to de-noise. When the algorithm is trained with billions of examples, it becomes able to generate an image from any arbitrary noisy image. And the most remarkable aspect of this process is that the algorithm is able to generalise from its training data: it is able to de-noise images that it never “saw” during the phase of training.

Another aspect of diffusion in physics is of importance in image generation can be seen at the end of the definition of the concept as stated in Wikipedia (emphasis is ours):

diffusion is the movement of a substance from a region of high concentration to a region of low concentration without bulk motion.[4]

Diffusion doesn't capture the movement of a bounded entity (a bulk, a whole block of content), it is a mode of spreading that flexibly accommodates structure. "Diffusion" is the gradual movement/dispersion of concentration within a body with no net movement of matter."[4] This characteristics makes it particularly apt at capturing multi level relations between image parts without having to identify a source that constraints these relations. It gives it access to an implicit structure. Metaphorically, this can be compared to a process of looking for faces in clouds (or reading signs in tea leaves). We do not see immediately a face in a cumulus, but the faint movement of the mass stimulates our curiosity until we gradually delineate the nascent contours of a shape we can begin to identify.

the process of adding noise goes from left to right and the de-noising runs the process backwards to obtain the spiral back from noise.(Sohl-Dickstein et al., 2015) — The process of adding noise goes from left to right and the de-noising runs the process backwards to obtain the spiral back from noise.[5]

Stabilising diffusion

Diffusion as presented by Sohl-Dickstein and colleagues is at the basis of many current models for image generation. However, no user deals directly with diffusion as demonstrated in the paper. It is encapsulated into software and a whole architecture mediates between the algorithm and its environment (see diagram of the process). For instance, Stable Diffusion is a model that encapsulates the diffusion algorithm and makes it tractable at scale. Rombach et al., the brains behind the Stable Diffusion model, popularize the diffusion technique by porting it in the latent space.[6] Instead of working on pixels, the authors performed the computation on compressed vectors of data and managed to reduce the computational cost of training and inference. They thereby popularised the use of the technique, making it accessible to a larger community of developers, and also add important features to the process of image synthesis:

By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.[7]

A diagram of AI image generation (Stable Diffusion), separating 'pixel space' from 'latent space' - what you see, and what cannot be seen and with an overview of the inference process (by Nicolas Maleve)

Diffusion can be guided by text prompts and other forms of conditioning inputs such as images, opening it up to multiple forms of manipulation and use such as inpainting. It stabilises diffusion in the sense that it allows for different forms of control. The diffusion algorithm in itself doesn't contain any guidance. This is an important step to move the algorithm outside of the worlds of github and tech tutorials into a domain where image makers can experiment with it. The pure algorithm cannot move alone.

But if diffusion is relatively stabilized in technical terms (input control and infrastructure), its adoption by increasingly large circles of users and developers has contributed to different forms of disruption for the best and the worst, parodies and deepfakes, political satyre and revenge porn. Once in circulation, it moves both as a product and as images.

Rhetorically, it becomes a metaphor within a set of nested metaphors that include the brain as computer, concepts such as 'hallucinations' or deep 'dreams' that respond to a more general cultural condition. As Zylinska notes:

We could perhaps suggest that generative AI produces what could be called an ‘unstable’ or ‘wobbly’ understanding – and a related phenomenon of ‘shaky’ perception. Diffusion [...] can be seen as an imaging template for this model.

Still according to Zylinska, this metaphor posits instability as an organizing concept for the image more generally:

Indeed, it is not just the perception of images but their very constitution that is fundamentally unstable.[8]

As a concept, it is in line with a general condition of instability due to the extensive disruptions brought on by the flows of capital. The wobbly, risky, financial and logistical edifice that supports Stable Diffusion's development testifies to this. The company Stability AI, funded by former edge fund manager Emad Mostaque, helped finance the transformation of the "technical" product into a software available to users and powered by an expensive infrastructure. And also to sell it as a service. To access large scale computing facilities, Mostaque raised $100 millions in venture capital.[9] His experience in the financial sector helped convince donors and secure the financial base. The investment was sufficient to give a chance to Stability to enter the market. Moving from the computer lab to a working infrastructure required to ground the diffusion algorithm into another material environment comprising Amazon servers, the JUWELS Booster supercomputer, tailor made data centers around the world.[10] This scattered infrastructure corresponds to the global distribution of the company's legal structure: one leg in the UK and one leg in Delaware. The latter offering a welcoming tax environment for companies. Dense networks of investors and servers supplement code. In that perspective, the development of the Stable Diffusion algorithm is inseparable from risk investment. These risks take the form of a long string of controversies and lawsuits, especially for copyright infringement and the eventual firing of Mostaque from his position of CEO after aggressive press campaigns against his management. Across all its dimensions, the shaky nature of this assemblage mirrors the physical phenomenon Stable Diffusion's models simulate.

Stabilising diffusion means attending a huge range of problems happening simultaneously that require extremely different skills and competences such as identifying faulty GPUs, decide on batch sizes in training, and the impact of different floating-point formats on training stability, securing investment and managing delays in payment, pushing against legal actions, and aligning prompts and images.

How does diffusion create value? Or decrease / affect value?

The question of value needs to be addressed at different levels as we have chosen to treat diffusion as a complex of techniques, algorithm, software, metaphors and finance. How is value expressed in the different aspects of this constellation?

First we can consider it as an object concretised in a material form. The model is at the core of a series of online platforms that monetize access to the model. With a subscription fee, users can generate images. Its value stems from the model's ability to generate images in a given style (i.e., Midjourney), with a good prompt adherence, etc. It is a familiar value form for models: AI as a service that generates revenue and capitalize on the size of a userbase.

As the model is open source, it also can be shared and used in different ways. For instance, users can use the model locally without paying a fee to Stability AI. It can also be integrated in peer-to-peer systems of image generation such as Stable Horde or shared installations through non-commercial APIs. In this case, the model gains value with adoption. And as interest grows, users start to build things with it as LoRAs, bespoke models, and other forms of conditioning. Through this burgeoning activity, the model's affordances are growing. Its reputation increases as it enters different economies of attention where users gain visibility by tweaking it, or generating 'great art'.

Gains value by comparison // SMTH SEEMS TO BE MISSING // . In parallel, in scientific circles, the model's value is measured by different metrics. Computer vision papers, comparative graphs: state of the art vs our method. Ability to do what cannot be done by others or less well. 'inversion' (or the ability to flexibly transform an image attribute without unwanted changes[12]) Multiple modalities. Speed. // AGAIN, SMTH SEEMS TO BE MISSING // .Authors such as Sohl Dickstein or Rombach have gained in reputation that can be evaluated through citation index.

Decreases value of the singular image. Increases value of the image ensemble. To learn how to generate images, algorithms such as Stable Diffusion or Imagen need to be fed with examples. These images are given to the algorithm one by one. Through its learning phase, the algorithm treats them as one moment of an uninterrupted process of variation, not as singular specimens. At this level, the process of image generation is radically anti-representational. It treats the image as a mere moment (“quelconque”^[1]), a variation among many. But the model gains singularity.

In the stable diffusion ecosystem, the ability to experiment is one of the highest value^[2]. The dynamics instilled by the project are well captured by Patrick Esser, a lead researcher on diffusion algorithms, who defined his ideal contributor as someone who would “not overanalyze too much” and “just experiment” (Jennings 2022). The project’s politics of openness was motivated by the realization that its ambitions exceeded the narrow goal of crafting a good product:

“It’s not that we're running out of ideas, we’re mostly running out of time to follow up on them all. By open sourcing our models, there's so many more people available to explore the space of possibilities.” (Jennings 2022)

What is its place/role in techno cultural strategies?

As a concept that traverses multiple dimensions of culture and technology, diffusion begs questions about strategies operating on different planes. In that sense, it constitutes an interesting lens to discuss the question of the democratization of generative AI. As a premise, we adopt the view^[3] that the relation between genAI and democracy cannot be reduced either as one of apocalypse where artificial intelligence signals the end of democracy nor that we inevitably move towards a better optimized future where a more egalitarian world emerges out of technical progress. Both democracy and genAI are unaccomplished projects, risky works in progress. Instead of simply lament, genAI 's "use for propaganda, spread of disinformation, perpetuation of discriminatory stereotypes, and challenges to authorship, authenticity, originality", we should be see it as an opportunity to situate "the aestheticization of politics within democracy itself (Benjamin 2007; Park 2024)". ^[3] In short we think that the relation between democracy and genAI should not be framed as one of impact (where democracy as a fully achieved project pre-exists; AI impact on democracy), but one where democracy is still to come. And in reverse, where AI as a fully formed entity should be governed, but has still to be made. And in this relation how both entities inform each other. Diffusion as a transversal concept is a device to identify key elements of this mutual "enactment". They pertain to different dimensions of experience, sociality, technology and finance; to different levels of logistics and different scales. The dialectics of diffusion and stability we tried to characterized is therefore marked by loosely coordinated strategies that include (in no particular order):

providing to a concrete resource, the model's weights without fee (democracy as equal access to resources)
producing and disseminating different forms of knowledge about AI: papers, code, tutorials (democratization of knowledge)
offering different levels of engagement: as a user of a service, as a dataset curator, as a LoRA creator, as a Stable Horde node manager (democratization as increase of participation)
freedom of use in the sense that the platform's censorship is either absent or can be bypassed locally (democracy as (individual) freedom of expression)

How does it relate to autonomous infrastructure?

Stable Horde as a paradigmatic example of a strategy that deals with the dialectics of stability and diffusion at different levels of experience, governance and operationality. (Discuss) // I wonder if this is necessary as a separate heading? //

[1] Joanna Zylinska, “Diffused Seeing: The Epistemological Challenge of Generative AI,” Media Theory8, no. 1 (2024): 229–258, https://doi.org/10.70064/mt.v8i1.1075.

[2] Russakovsky et al MISSING REFERENCE!!!

[3] Sohl-Dickstein MISSING REFERENCE!!!

[4] Wikipedia, s.v. “Diffusion,” last modified August 12, 2025, https://en.wikipedia.org/wiki/Diffusion.

[5] Sohl-Dickstein MISSING REFERENCE!!!

[6] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” arXiv preprint arXiv:2112.10752, last revised April 13, 2022, https://arxiv.org/abs/2112.10752.

[7] Rombach et al., “High-Resolution Image Synthesis.”

[8] Zylinska, “Diffused Seeing.”

[9] Wiggers 2022 // MISSING REFERENCE //

[10] Jülich Supercomputing Centre, JUWELS Booster Overview, accessed August 12, 2025, https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html.

[11] Pictet group advertises its services with the rhetoric of stability, a response to global instability // MISSING REFERENCE //

[12] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye, DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation, arXiv preprint arXiv:2110.02711 (2021), https://doi.org/10.48550/arXiv.2110.02711.

References

↑ For a discussion of the difference between privileged instants and “instants quelconques” see Deleuze’s theory of cinema, in particular https://www.webdeleuze.com/textes/295 (find translation)
↑ Open Infrastructure article
↑ ^3.0 ^3.1 Discussed here in details http://dx.doi.org/10.1007/s00146-024-02102-y

Unused references

From "Maps": _Secondly, there is a 'latent space'. Image latency refers to the space in between the capture of images in datasets and the generation of new images. It is a algorithmic space of computational models where images are, for instance, encoded with 'noise', and the machine then learns how to how to de-code them back into images (aka 'image diffusion')._

[1] For a discussion of the difference between privileged instants and “instants quelconques” see Deleuze’s theory of cinema, in particular https://www.webdeleuze.com/textes/295 (find translation)

[2] Open Infrastructure article

[:2-3] 3.0 ^3.1 Discussed here in details http://dx.doi.org/10.1007/s00146-024-02102-y

[1]

[2]

[3]

@@ Line 42: / Line 42: @@
 The question of value needs to be addressed at different levels as we have chosen to treat diffusion as a complex of techniques, algorithm, software, metaphors and finance. How is value expressed in the different aspects of this constellation?
-First as a model concretised in a material form: the weights. The model is at the core of a series of online platforms that monetize access to the model. With a subscription fee, users can generate images. Its value stems from the model's ability to generate images in a given style (i.e., Midjourney), with a good prompt adherence, etc. It is a familiar value form for models: AI as a service. This can be expressed in the form of revenue, the size of a userbase etc.
+First we can consider it as an object concretised in a material form. The model is at the core of a series of online platforms that monetize access to the model. With a subscription fee, users can generate images. Its value stems from the model's ability to generate images in a given style (i.e., Midjourney), with a good prompt adherence, etc. It is a familiar value form for models: AI as a service that generates revenue and capitalize on the size of a userbase.
-As the model is open source, it also can be shared and used in different ways. For instance, users can use the model locally without paying a fee to Stability AI as it is open source. It can also be integrated in peer-to-peer systems of image generation such as [[Stable Horde]] or shared installations through non-commercial APIs. In this case, the model gains value with adoption. And as interest grows, users start to build things with it. LoRAs, bespoke models, other forms of conditioning. Through this burgeoning activity, the model's affordances are growing. It gains a form of reputation and enters different economies of attention where users gain visibility by tweaking it, or generating 'great art'.
+As the model is open source, it also can be shared and used in different ways. For instance, users can use the model locally without paying a fee to Stability AI. It can also be integrated in peer-to-peer systems of image generation such as [[Stable Horde]] or shared installations through non-commercial APIs. In this case, the model gains value with adoption. And as interest grows, users start to build things with it as LoRAs, bespoke models, and other forms of conditioning. Through this burgeoning activity, the model's affordances are growing. Its reputation increases as it enters different economies of attention where users gain visibility by tweaking it, or generating 'great art'.
 Gains value by comparison '''<u>// SMTH SEEMS TO BE MISSING //</u>''' . In parallel, in scientific circles, the model's value is measured by different metrics. Computer vision papers, comparative graphs: state of the art vs our method. Ability to do what cannot be done by others or less well. 'inversion' (or the ability to flexibly transform an image attribute without unwanted changes[12]) Multiple modalities. Speed. '''<u>// AGAIN, SMTH SEEMS TO BE MISSING //</u>''' .Authors such as Sohl Dickstein or Rombach have gained in reputation that can be evaluated through citation index.