Exploring the Images Used to Train Stable Diffusion’s AI

Andy Baio and Simon Willison looked through 12 million images and made a data browser you can use yourself.

To generate accurate pictures based on prompts, a text-to-image AI model Stable Diffusion was trained on 2.3 billion images. Andy Baio with help from Simon Willison discovered what some of them are and even created a  data browser so you can try it yourself.

The duo took the data for over 12 million images used to train Stable Diffusion and found out how this dataset was collected, the websites it most frequently pulled images from, and the artists, famous faces, and fictional characters most frequently found in the data.

Stable Diffusion was trained on three datasets collected by LAION, which image datasets are built off of Common Crawl, "a nonprofit that scrapes billions of webpages monthly and releases them as massive datasets. LAION collected all HTML image tags that had alt-text attributes, classified the resulting 5 billion image-pairs based on their language, and then filtered the results into separate datasets using their resolution, a predicted likelihood of having a watermark, and their predicted 'aesthetic' score."

For the research, the authors took LAION-Aesthetics v2 6+, which includes 12 million image-text pairs with a predicted aesthetic score of 6 or higher.

The research showed that almost half of the images were sourced from only 100 domains, with the largest number of images coming from Pinterest. Other sources include WordPress-hosted blogs, Smugmug, Blogspot, Flickr, DeviantArt, Wikimedia, 500px, and Tumblr. Shopping sites were also well-represented.

Using the list of over 1,800 artists in MisterRuffian’s Latent Artist & Modifier Encyclopedia to search the dataset, the authors found out that the most frequently referenced artist is Thomas Kinkade, followed by Vincent van Gogh, Leonid Afremov, and Claude Monet.

To see how well-represented celebrities are in the dataset, Baio and Willison took two lists of famous people and merged them into a list of nearly 2,000 names. It turns out, Donald Trump is one of the most cited names in the image dataset, with Charlize Theron being a close runner-up. The authors note that popular internet personalities don’t appear in the captions from the dataset, which might mean the CommonCrawl data was too old to include them.

Fictional characters from the MCU, like Captain Marvel, Black Panther, and Captain America are some of the best represented in the dataset, according to the research.

Interestingly, NSFW content is present in the image dataset, but there's not a lot of it. "The Stable Diffusion team built a predictor for adult material and assigned every image a NSFW probability score, which you can see in the “punsafe” field in the images table, ranging from 0 to 1."

"Only 222 images got a “1” unsafe probability score, indicating 100% confidence that it’s unsafe, about 0.002% of the total images — and those are definitely porn. But nudity seems to be unusual outside of that confidence level: even images with a 0.9999 punsafe score (99.99% confidence) rarely have nudity in them."

The authors admit that the filtering on aesthetic ratings could be removing large amounts of NSFW content from the image dataset.

You can find the full report here if you're interested in more detailed results. Also, don't forget to join our Reddit page and our Telegram channel, follow us on Instagram and Twitter, where we share breakdowns, the latest news, awesome artworks, and more. 

Join discussion

Comments 0

    You might also like

    We need your consent

    We use cookies on this website to make your browsing experience better. By using the site you agree to our use of cookies.Learn more