Dataset

In the context of AI image generation, a dataset is a collection of a collection of image - text pairs (and sometimes other attributes such as provenance or an aesthetic score) used to train AI models. Iconic datasets include the LAION aesthetic dataset, Artemis, ImageNet, or Common Objects in Context (COCO). These collections of images, mostly sourced from the internet, reach dizzying scales. ImageNet became famous for its 14 millions images in the first decade of the century. Today LAION-5B consists of 5,85 billion CLIP-filtered image-text pairs.

If large models such as Stable Diffusion require large scale datasets, various components such as LoRAs, VAEs, refiners, or upscalers can be trained with a much more modest amount of data. In practice, this means that for each of these components, a custom dataset is created. As each of these datasets reflects a particular aspect of visual culture, the components trained on them function as conduits for imaginaries and world views. Image generators are not simply produced through mathematics and statistics, they are programmed by images. Programming by images is a specific curatorial practice that involves a wide range of skills including a deep knowledge of the relevant visual domain, the ability to find the best exemplars, many practical skills such as scraping, image filtering, cleaning and cropping, and mastering the art of a coherent classification and annotation. In our tour, we discuss two examples of curatorial practices of different scales and purpose: the creation of the LAION dataset and the art of collecting the images that are necessary to "bake the LoRA cake".