Scott Lighthiser showed how the creature was made in Stable Diffusion and EbSynth, explained how the images were upscaled, and talked about the future of the technology.
Introduction
I'm Scott Lighthiser, a Writer and Director from Michigan. I studied Production at Lansing Community College and have been experimenting on my own with filmmaking techniques since then.
Making Creatures with Stable Diffusion
I've been trying to improve on my experience with cinematography and I'm always looking for visual references and inspiration. When I got into the beta testing phase of Stable Diffusion, I was able to create a lot of really imaginative cinematic imagery.
My latest creature test was to see if I could bring the type of photoreal characters to life.
Stable Diffusion is also able to take an initial image and create a new image from it. For the shoot, I set up some plastic sheets for a large diffused backlit backdrop, then set up a light on the side through more plastic for some soft key, followed by a bounce on the other side. I got in front of the camera and started making weird creature motions.
After some simple color grading, I used Blender to render them as .png sequences. Then I chose a keyframe; here's a good one:
I uploaded this to Imgur and pasted the direct image URL into the initial image space in the NOP & WAS's Stable Diffusion Google Colab. It should be said that Stable Diffusion has constraints as to what aspect ratios you can generate, you have to do a lot of resizing to be sure the ratios are going to match. For instance, I usually work in 1890x1080 instead of 1920, for now at least, but I think segmenting what you plan to edit and then compositing back into the frame would maintain quality a little better. There's a slider to vary how much strength your own image has over the final output, play around with this but you're still going to want your main features to line up in the image. You can always edit it in Photoshop though. For my prompt, I did a subject and then load clarifying and artistic keywords. One of my outputs was this:
But this was still at 512x896, generating at this size helps with speed but there's not enough detail to use. Using a standard upscaler works sometimes but also reduces quality in my opinion, like this:
Instead of that, I've been trying to find a way to mimic what the Midjourney upscaler does. It adds some nice detail and sort of reinterprets the image, from what I can tell at least. So in this Colab, there are options to render at a higher size, so I put my 512 output back in, same prompt, and strength more toward the image. After that, I got this:
There's a lot more detail and it's at 1024x792. I'll probably get Colab Pro to get this higher in the future. Then I used Photoshop to make a mask of my original shape, the picture of just me with no edits yet. These are the only pixels that are going to move, so I only want the creature within this area.
Now the creature is in the same physical environment with matched lighting.
A program like EbSynth can take this edited frame and use it as a keyframe to apply this visual information to the motion information of your video. It can't create new information, so if I turned, my head it would just start smearing. Keep this in mind when shooting.
Now, if I wanted this same Nosferatu-looking thing in a different pose, I don't think I could really do that with where the tech is as of now. There is a process called textual inversion where you can train Stable Diffusion on an image or concept to use in various generations. For that, you need 3 to 5 images from different angles, I believe, so that may be difficult in this case. I can see a possibility that when Stable Diffusion generates an image, it tokenizes certain aspects of the generation which could be used again. A "token" in Stable Diffusion terms is a discrete term or concept Stable Diffusion uses to understand prompts.
The Future of the Technology
Initially, this technology will allow artists in the film industry to expand their visual vocabulary and augment their imagination in a huge way. Advancing from pre-production, imagine being on an LED Volume stage describing an elaborate backdrop and then seeing it in front of you. With the rate of growth of this technology, I can see the adoption and optimization happen within a few years.
And when visual quality is finally democratized, when anyone can create beautiful images in the blink of an eye, the true value settles where it always has and always will: the ability to tell a story, to connect to the soul of a work, to use both your heart and head to craft an emotional experience for people around the world.