Stable Diffusion 2.0 & 2.1: An Overview
Stability AI moves fast and breaks things
Before diving into this, you should read my previous guide to Stable Diffusion if you haven’t already. Also, I was going to cover the Artstation protest and other controversies around Stable Diffusion in this post, but that will have to wait for the next installment of my “AI wars” series (Part 1, Part 2). So look for that, soon.
To receive new posts and support my work, consider becoming a free or paid subscriber.
On November 24, 2022, Stability AI released the 2.0 version of Stable Diffusion. Then just two weeks later, they pushed out version 2.1. The short span of time between 2.0 and 2.1 wasn’t solely because the company is trying to iterate faster. Rather, 2.0 was widely criticized by its large community of users that had embraced the initial release of Stable Diffusion 1.5.
Many of these users, myself included, fiddled around with 2.1 and found that none of our prompts were giving us the result we wanted, then went right back to using Stable Diffusion 1.5. The subsequent 2.1 release tried to address some of the criticisms, but in the view of many, it didn’t go far enough.
In a nutshell, Stable Diffusion 2.0 and 2.1 have the following things working against them, some of which are just the inevitable growing pains of innovation but others of which are the result of the AI wars that we’re just in the opening battles of.
A new text language interface that subtly invalidated most users’ intuitions about how to work with the model.
A totally remixed embedding space that effectively scrapped a massive, crowdsourced map of 1.5’s higher-dimensional latent space.
More guardrails against generating NSFW content.
The result of the above additions in the 2.x versions is that individuals and communities that had gotten good at working with Stable Diffusion 1.5 suddenly had to start all over again from scratch. This reset has not been popular with users, but for those who’ve stuck it out, the new version offers a number of notable upgrades that make it worth adding to your toolbox alongside 1.5.
In this update, I’ll walk you through what 2.0 tried to do and where the community felt it failed. I’ll also talk about 2.1’s attempts to address some of the criticisms of 2.0. Finally, I’ll update my recommendation on the best way to run Stable Diffusion locally.
New feature: Depth-to-image
The most powerful new feature of Stable Diffusion 2.x is called depth-to-Image. This feature is not a separate feature from the existing image-to-image feature but rather is incorporated as an enhancement into the older feature.
What it is and how it works
Depth-to-image takes a two-dimensional image and attempts to infer a depth map from it. It then uses this depth map to generate a new image from the prompt that’s built around the basic 3D structure of the input image.
You can see this in action in the tweet above, but let me unpack it further with an analogy.
Imagine a piece of software that can look at a satellite photograph of a patch of forested land and then construct a topographical map from that photo based on a combination of subtle visual cues like shadows and known information like the average heights of certain types of trees. That’s the depth map part of the process.
Then, imagine a second piece of software that can take the 3D topographical map output in the preceding step, and then use a text prompt to re-skin it so that it looks like an entirely different kind of terrain — maybe desert, tundra, or urban sprawl. That’s second part is the image generation piece.
Stable Diffusion 2.x uses a machine learning model called MiDaS that’s trained on a combination of 2D and 3D image data — in particular, it was trained using a 3D movies dataset containing pairs of stereoscopic images. With this training, the model can look at a 2D scene and then back into a 3D representation of it.
This 3D representation, then, can form the basis of an image-to-image generation step that preserves the 3D features of the input image while giving it different characteristics.
What you can do with it
In SD 1.5, the image-to-image feature would mostly preserve colors and shapes from the original, so this worked well in some cases and terribly in others. I mostly found it useful for turning crude 2D art into something less crude, but I’ve seen AI artists do really amazing things with it in some private Facebook groups. It’s a lot of work to go from a crude Procreate drawing to something really nice because it requires multiple iterations of taking the output of the previous image-to-image step and feeding back into the model with the same prompt and tweaked parameters. What you often end up with is something that’s only very loosely based on the input drawing.
SD 2.x’s depth model is a major upgrade to this experience, and it opens up new possibilities for one-pass photo retouching. In my first attempts to play with it, I added a beard to a photo of Keanu Reeves. This was pretty painful to look at, though — the face was all wrong.
I restarted my efforts with the image below of a 3D model, which I found on Google Images.
After an hour and a half of fiddling with parameters and prompts, I was finally able to get a really nice Batman from the source image:
I then generated the following superheroes by just swapping the name “Batman” out of the prompt for other hero names:
There was no need for multiple loops of image-to-image runs, and the three-dimensional aspects of the original — the lighting, pose, musculature, etc. — are all very well-preserved.
Here are the params I used for all of these, which you should be able to use to recreate them:
Prompt: Batman standing against a black background, trending on artstation, sharp focus, studio photo, intricate details, soft natural volumetric cinematic perfect light, highly detailed, unreal 3D, by greg rutkowski
Negative Prompt: cartoon, ugly, unrealistic, no face, no hands, messed up hands, too many fingers, mist, missing face, messed up face, no mouth
Sampling Steps: 40
Sampling Method: DPM2 a Karras
Size: 976 x 512
CFG Scale: 18.5
Denoising Strength:: 0.71
I tried to tweak the prompt to spice up the background, and it worked well enough to show that the model understands the scene well enough to separate the background from the foreground while preserving the lighting on the foreground figure:
Notice how the generated images retain so many features of the original 3D model image. The lighting comes from the same directions, and the musculature and pose are largely preserved — the superhero images are recognizably derivative of the original scene. And again, it just takes one pass to do this — not multiple, progressive iterations as in SD 1.5.
It was simply not possible to do any of the above with SD 1.5, so the new version’s depth model is a true advance in the state-of-the-art. Even if you can’t use SD 2.x to generate the kind of artwork you’re used to getting from SD 1.5, you owe it to yourself to spend time learning to use depth-to-image and developing it as a tool in your toolbox.
New feature: OpenCLIP
Stable Diffusion 1.0 used OpenAI’s CLIP model for interpreting text prompts, but 2.x switches over to a newer model called OpenCLIP that brings a number of improvements, but also is the source of some complaints.
What it is and how it works
Diffusion models like DALL-E and Stable Diffusion use a CLIP (Contrastive Language–Image Pre-training) model to connect text prompts with points in latent space. In Stable Diffusion, the CLIP model is used in a specific way: it translates text prompts into embeddings, or coordinates that are lower-dimensional representations of points in the latent space of the diffusion model.
A more accessible way to describe the CLIP model would be to say that it’s responsible for translating natural language words into concepts that the diffusion model has learned and can work with. So if I give the clip model the prompt, “cat on a hot tin roof,” then its job is to take that string of text and locate the regions in latent space that correspond to concepts like “cat,” “roof,” “on top of,” “tin as a building material,” etc.
Both the OpenCLIP model and its predecessor are open-source, but only OpenCLIP was trained on an open database of images. This open database has a few important differences from the proprietary database OpenAI used to train the CLIP model used in SD 1.5:
It was explicitly and aggressively filtered for NSFW images
It contains fewer celebrities and named artists
Some terms that dominated CLIP to an unreasonable degree (“Greg Rutkowski,” “trending on artstation”) are now either absent or their dominance is greatly reduced.
All of the above have caused endless complaints from Stable Diffusion users that the new model was “nerfed” or otherwise wrecked. There have also been rumors and misunderstandings to the effect that some artists and art styles were deliberately removed from the OpenCLIP training data. More on this below, though.
How it differs from the previous version
I think the “nerfed” complaint has some merit, mainly because the deliberate removal of NSFW imagery really degraded the quality of the output in 2.0 and made it too hard to get good images of humans. Here’s how Stability AI described the issue in their 2.1 release post:
When we set out to train SD 2 we worked hard to give the model a much more diverse and wide-ranging dataset and we filtered it for adult content using LAION’s NSFW filter. The dataset delivered a big jump in image quality when it came to architecture, interior design, wildlife, and landscape scenes. But the filter dramatically cut down on the number of people in the dataset and that meant folks had to work harder to get similar results generating people.
The 2.1 version featured a fine-tuned iteration of OpenCLIP (not completely new weights from a new training run) with a set of much more relaxed NSFW filters in place. So 2.1 gives better results on pictures of people — in particular hands and other complex anatomy — than its immediate predecessor did.
As far as the rumor that artists (like Rutkowsi) who objected to being included in 1.5 had their work deliberately removed, that is false. There was just a different dataset with different contents, which makes for different mappings of text strings to latent space. There was less art in the new dataset, though, so the 2.x models aren’t as good at doing a number of art styles.
As for the improvements, Stability AI claims the quality of the generated images is much better than what was possible with the previous CLIP model. This is quality boost difficult to evaluate, though, unless you’re looking at specific types of output. In their blog post showing off the model, Stability AI shows some photographic-quality output of the type that was difficult or impossible to generate with the SD 1.5. (DALL-E was famously better at photographic output, while SD was better at artistic output.)
It also looks likely that the new CLIP model is better at giving results that match the prompt, given the model’s improvements on certain ImageNET captioning benchmarks.
In my own informal testing, I saw both of these improvements — improved quality and better mapping of text to concepts. As an example of the quality improvement, check out the response of the two models to the simple prompt,
a rabbit in Times Square, with the same parameters:
You can see that the road and other features of the 2.1 version’s output are higher quality and have fewer weird artifacts. The results are just more realistic and less odd.
As for the concepts, I tried a harder prompt,
An astronaut playing golf on a golf course to see what the two models could do with it.
The 1.5 version’s images are better quality, but there’s no astronaut in there. In contrast, 2.1’s images were more varied in style and less photographic but nailed the astronaut component in three of the four. I did multiple runs of this prompt on both models, and the results above are quite representative — 1.5 never really got the astronaut (in some runs you might get a golfer with a helmet-like thing on his head), but 2.1 almost always got an astronaut in there.
More new OpenCLIP features: higher resolutions & negative prompts
Stable Diffusion’s original CLIP model could only output 512x512 images — you could do higher-resolution images with Stable Diffusion 1.5, but you had to use an add-on upscaler on the 512x512 output.
Version 2.x’s OpenCLIP model now supports natively generating both 512x512 and 768x768 image sizes. Here’s the thing, though: you really have to use the new 768x768 size to get decent results. Most of the initial hour and a half I spent struggling with depth-to-image and not getting good results was entirely down to the fact that I was using the 512x512 image size. The moment I bumped up the resolution above 768x768, the results improved instantly and dramatically. From there, it took me only a few minutes to iterate around to the samples I showed in the previous section.
There’s also a new Upscaler Diffusion model built into 2.x that can take images in either of the aforementioned native sizes and upscale to much higher resolutions.
Finally, version 2.1 adds the ability to generate resolutions of arbitrary aspect ratios, like Twitter’s 1500x500 banner image size. So you don’t need to either settle for square output or pass it to another upscaler.
Negative prompts and prompt weighting
One of the biggest changes OpenCLIP brings to the way SD 2.x works is the new emphasis on negative prompts for getting good results. While a normal prompt tells the model what you want to see in the image, negative prompts tell it what you don’t want.
To use the prompt-as-search framework from an earlier article on generative AI, you can think of negative prompts like the exclusion operator in Google searches, i.e., the minus sign that tells Google that you want results that do not have the excluded term.
Stable Diffusion has always supported negative prompts, but now they’re even more effective and important. Specifically, negative prompts are the key to fixing some of the well-known problems with Stable Diffusion, in particular, the mangled hands issue. You can often get better results for hands by including things like
too many fingers or
ugly hands as negative prompts.
Stability AI claims negative prompts “work even better” in 2.1 than they do in 2.0, but I haven’t played around with it enough to quantify what exactly they mean by that.
Running Stable Diffusion 2.1 locally
If you’re looking to upgrade your skills and you really want to get the most out of Stable Diffusion 2.1, especially when it comes to working with depth-to-image, then right now it’s best to run it locally. Not all the cloud services you might end up using support all the features and options.
I used to recommend a local app called DiffusionBee for the Mac, but at this point, the popular Automatic1111 WebUI repo is easy enough to use that I now recommend it, instead. If you have even very basic command-line skills — i.e., you can use the homebrew package manager to install software — then this is very easy to get up and going.
If you don’t have such skills, it’s probably best to just keep using a cloud service and skip the local install and keep doing it online. It’s going to be faster anyway unless you have a ton of RAM and a pretty robust machine.
You might think of running SD locally as the “buying a DLSR camera” of image generation, whereas a cloud service is going to give you a more “point and shoot” level of experience. You can still get great results with the cloud services, but a local install will give you much more freedom to experiment, especially with the latest features.
To sum up, Stable Diffusion 2.0 brings the following changes from 1.5:
A new training dataset that features fewer artists and far less NSFW material, and radically changes which prompts have what effects.
Poorer rendering of humans, due to the aforementioned NSFW filters
Fewer celebrities and artists in the dataset, meaning it performs more poorly on celebrities and many art styles.
Better rendering of photorealistic output, as well as architecture, building interiors, and outdoor scenes.
Better mapping of prompts to concepts
Higher resolution output
Better use of negative prompts
Stable Diffusion 2.1 adds the following upgrades to the above list:
Arbitrary resolution output directly from the model using a built-in upscaler.
Even better use of negative prompts
Restores more of 1.5’s ability to render people
In all, I’m glad 2.1 exists, but I do wish they’d release another set version that features OpenCLIP trained on the OpenAI dataset. This is probably not possible, though, since the OpenAI dataset is proprietary, so those who want the best of both worlds will have to use both 1.5 and 2.1 versions alongside each other for different purposes.
Finally, I’m a bit worried about the next version of Stable Diffusion, since it’s the first to feature an artist opt-out mechanism. There are a number of pros and cons to this that I’ll go into in a follow-up piece on the recent Arstation protests, but for now, I’d say most people should expect to keep using SD 1.5 for artistic output until we end up in the world described in the following thread (it’s not far off):
Postscript: Breaking changes
The broadly negative reaction to Stability AI’s replacement of the original CLIP model with a new OpenCLIP highlights a major problem with the pace of innovation in this space: every time a new set of model weights is published, it’s essentially a breaking change.
Wikipedia defines “breaking change” as, “A change in one part of a software system that potentially causes other components to fail; occurs most often in shared libraries of code used by multiple applications.” This pretty well describes what happened with SD 2.x.
The fact that the CLIP model is the text-based interface for latent space means that swapping in a new CLIP model amounts to a major UI overhaul that zeroes out a lot of the community’s distributed, crowdsourced knowledge about how to get the best images out of the model.
To understand just how disruptive the new CLIP model was and how much existing value it gave up in exchange for its new features, it’s best to think of the community response to the original Stable Diffusion launch as a kind of distributed, crowdsourced project to create a map of the model’s latent space.
There was a massive amount of dedicated iteration and trial-and-error that went into the development of prompt formulas and so-called “filters” — bits of stored text that could be appended to a user-supplied prompt in order to steer the output into a region of latent space that had a particular look or set of qualities.
This group project took place on Reddit, Twitter, YouTube, Discord, and other social venues where generative AI enthusiasts gathered to share tips and tricks for getting the best out of Stable Diffusion. A number of social AI sites like PlaygroundAI and PromptHero sprung up for sharing prompts, and this turbocharged progress in this area.
In all, it’s impossible to quantify the number of person-hours of effort spent mapping the latent space of Stable Diffusion 1.5, but it’s surely one of the largest crowdsourced projects since Wikipedia.
All of this work goes out the window, though, when when the embeddings and mappings that give access to the model’s latent space change.
Indeed, I’ve been keeping an eye on PlaygroundAI and I’m just not seeing that many users of Stable Diffusion 2.1 in the mix. There are also only three filters available for it, which suggests that it’s just not getting much attention. Meanwhile, the number of filters for Stable Diffusion 1.5 continues to grow. I take this as indicative of a general community rejection of 2.x. I am seeing people working with SD 2.1 on Reddit but in general, the pace of discovery there just feels much slower than it was with SD 1.5 — and again, there’s still a lot of activity around SD 1.5.