LAION
LAION
If our tour has led us into well-funded companies such as Hugging Face or CivitAI and their attachments in the heart of venture capital, it also leads us, at the opposite end of the financial spectrum to significant actors that operate within a minimal economy such as Stable Horde. The Large-scale Artificial Intelligence Open Network (LAION) fits in this category. It is a non-profit organization whose ambition is to democratize AI by encouraging public education and the re-use of existing datasets and models. LAION operates with small donations in the form of money but mostly in terms of cloud compute. [1]
LAION's co-funder, Christoph Schuhmann, is the driving force behind one major object of necessity for the generative AI ecosystem: a series of datasets that outscaled the existing offer. The curatorial method for these datasets was entirely automated. It leveraged cleverly available resources such as Common Crawl and Google Collab to download text-image pairs en masse from the internet. This curatorial method differs radically from the practice of affective involvement discussed in the LoRA entry where anime enthusiasts select images by hand from a visual domain they cherish. In the case of LAION5B that contains 5.85 billion images, the work of annotation is delegated to the then just-released CLIP model tasked to verify the relation between the downloaded images and the adjacent alt-text used as their description. The comparison is even more striking with a subsequent dataset, LAION-Aesthetics, consisting in a subset of the 5 billions images dataset that contains images with higher aesthetic quality. This object of high interest for the newly burgeoning field of image generation, that desperately looked for stylistically rich images to train algorithms, was assembled using an approach that again favoured integral automation. This time the selection was handled by a custom-made model trained on clip embeddings to evaluate the quality of images by attaching them an aesthetic score.
This can be explained by the fact that LAION operates with a minimal budget and could not afford the cost of manual verification and annotation of a dataset of that scale. But in the case of LAION, the automation of curation did not preclude artisanal practice. It displaced it. An interview given by Schuhmann shows the ad-hoc and low-tech nature of the bricolage that presided the creation of an object that helped sparked the development of image generation:
“Then in the spring of 2021, I sat down and just wrote down a huge spaghetti code in a Google Colab and then asked around on Discord who wanted to help me with it. Someone got in touch, who later turned out to be only 15 at the time. And he wrote a tracker, basically a server that manages lots of colabs, each of which gets a small job, extracts a gigabyte, and then uploads the results. At that time, the first version was still using Google Drive.” (Ibid)
“We then did a [blog post about our dataset](https://laion.ai/blog/laion-400-open-dataset/), and after less than an hour, I already had an email from the Hugging Face people wanting to support us. I had then posted on the Discord server that if we had $5,000, we could probably create a billion image-text pairs. Shortly after, someone already agreed to pay that: “If it’s so little, I’ll pay it.” At some point, it turned out that the person had his own startup in text-to-image generation, and later he became the chief engineer of Midjourney.” (Ibid)
In these two fragments, Schuhmann traces a line that goes from the management of the limits of user accounts on collab and Google Drives, the informality of meeting a coder on Discord that ends up being a teenager and the future chief engineer of a major company of the field. These anecdotes indicate how the dataset functions as an attractor for actors and projects of radically different scales and funding.
For research vs copyright “There is a Data Mining Law, an EU-wide exception to copyright. It allows non-profit institutions, such as universities, but also associations like ours, whose focus is on research and who make their results publicly available, to download and analyse things that are openly available on the internet.
We are allowed to temporarily store the links, texts, whatever, and when we no longer need them for research, we have to delete them. This law explicitly allows data mining for research, and that is very good.” (Ibid)