Digital Artist CoffeeVectors has told us about their latest animation experiment with Stable Diffusion and MetaHuman, explained how they generated the input face and set up the character, and discussed using Thin-Plate Spline Motion Model and GFPGAN for face fix/upscale.
Introduction
I currently freelance in the video/photo industry and take on a wide range of clients from around the world, but I usually find myself working with fashion and makeup brands. 3D and computer-generated work has been something I've gotten into over the past few years as I've seen it increasingly used in film, fashion, and media more generally.
I got really curious about exploring how creative work might change and evolve over the next decade and I decided to start expanding my skillset to be ready for those future clients that might have different needs than the ones I have today. I'm using CoffeeVectors as the avatar for my work focused purely on that space.
I'm self-taught, so I've been studying YouTube channels and online courses/resources, but I bring a lot of my production experience and aesthetic sense from my day job into interpreting what I've been absorbing. I've yet to contribute to any projects using my CGI skillset exclusively, but I'm hopeful that it won't be long before I can contribute some value to a project.
Animation Experiment with Stable Diffusion and MetaHuman
I started on the Stable Diffusion and MetaHuman project essentially the same day I posted the animation. Though in truth it was built on top of earlier explorations with Mesh to MetaHuman plug-in and different experiments with Unreal Engine 5. Essentially, I was trying to see what kind of workflow I could build bridging Epic's MetaHuman with AI image synthesis platforms like Stable Diffusion.
Generating the Input Face with Stable Diffusion
I used something called img2img in Stable Diffusion (other AI platforms have similar methods) where it takes an initial image as the starting point for the prompt. There's a noise setting that tells the AI how far it can deviate from the reference image. I set things up so that the result would be very close to the original.
In my experience so far, when you use img2img with restrictive settings, the prompts don't have as much effect on the end product as they usually might because the AI is drawing from multiple inputs. Macro descriptions like "oil painting", "illustration", or "sculpture" have effects, as well as describing styles like Anime, Art Nouveau, or Cubism. But the underlying image carries a lot of weight.
You can also feed your resulting image back into the AI as a new starting point and cycle through a few generations. With each cycle, you can change the prompt and steer things in slightly different directions. It becomes less about a single initial prompt and more about understanding/modifying the larger system of settings interacting with each other, like how you might track variables in a blueprint in Unreal.
For example, the different sampler models you can pick from (DDIM, k_euler, etc.) have a huge impact on the aesthetics. I have an RTX 3090 and run Stable Diffusion locally on my machine. I'm able to run scripts that can walk through values like the CFG Scale, steps, and different sampler models.
From that, I can generate a grid so I can look through a set of options. Many of them will have artifacts and sometimes completely different interpretations of the prompt which I then have to sift through to find useable images. Coming from a photo background, I think of it like a contact sheet, searching for selections scattered among options and outtakes.
My advice for prompting is a two-part answer. One, don't get tunnel vision on the prompts. There’s more to a car than the engine. Prompts are important but they’re not everything. With platforms like Dall-E 2 where underlying variables aren’t exposed, the prompts do play a dominant role.
But with Stable Diffusion and Midjourney, there are more controls available to you that affect the output. If you're not getting what you want from prompts alone in Stable Diffusion, for instance, it could be because you need to shop around the sampler methods and CFG Scale values. Even the starting resolution affects the images you get because it changes the initial noise pattern.
That leads to part two of my advice, it helps to have the right big picture mindset – you're having a back-and-forth conversation with the AI and trying to find common ground to coordinate.
You begin not speaking the same language, not seeing the same things, but over time you hopefully find a way to communicate and converge on an aesthetic or concept. It’s like searching for a Rosetta Stone specific to your project. Or in more AI terms, trying to communicate how to get to the useful regions of the latent space.
Having been on many different kinds of video productions where sometimes English isn’t the primary language, I've learned that being able to negotiate a path to a common vision is an incredibly important skill. And it’s not always people you’re negotiating with, but equipment, the weather, etc. It’s about being in the relationship and navigating it. Maybe it’s your relationship to the AI, maybe it’s the relationship of the AI portion of your workflow with all the other parts of your process.
Using MetaHuman
The setup was basically the Mesh to MetaHuman workflow. I just used a model from Daz Studio that I liked. I think MetaHuman is an incredible tool with so many possibilities, particularly in how it opens the door for so many creatives to make work with high-quality 3D humans. The Matrix demo just blew me away!
That said, what I like about Daz is the ability to customize and transform the body, and how easy it is to go into something like Marvelous Designer and build custom clothing for custom measurements. I hope in the future we see an extension of Mesh to MetaHuman applied to bodies where I can input a 3D scan or a 3D body from some other source and have it translate into a pre-rigged MetaHuman with all its high-quality materials.
I'm particularly interested in MetaHuman in the future of virtual production. And of course, there’s the huge potential with animated shorts and cinematics, particularly if we can drive final looks with something we synthesized in AI and maybe polish or add layers of art direction to in Photoshop.
Using Thin-Plate Spline Motion Model & GFPGAN for Face Fix/Upscale
I'm not sure how Thin-Plate Spline Motion Model and GFPGAN work in deep technical detail. I've read the research papers and a lot of it goes over my head, though I am learning more and more coding and machine learning so I hope one day these papers will make more sense.
That said, Thin-Plate is essentially a motion transfer model that uses a video to drive animated deformations in a target image. The video I made was actually the first time I used it and I was pretty shocked by how well it did with information that wasn't present in the reference image, particularly with blinking and the face turning.
GFPGAN is one of the go-to models that addresses face artifacts that might appear in AI-generated images. For example with eyes, you rarely get circular irises directly from the AI. Occasionally the nose and lips have odd shapes. GFPGAN can improve those issues and it also works as an upscale. It's usually something applied to still images, but if you run the model locally you can batch process an image sequence. Thin-Plate outputs a 256 x 256 video so GFPGAN helps with increasing the sharpness and resolution in addition to facial artifacts.
Limitations When Using the Thin-Plate Spline Motion Model & GFPGAN Approach
I'd be very curious to see how robust and coherent Thin-Plate, or something like it, could be. I know there's some really interesting research going on at NVIDIA with face frontalization and synthesizing novel views.
I'm curious if we can eventually get a 360 view of a head generated from a single keyframe while keeping temporal coherence and high consistency. How far can we push turning a 2D image into a moving 3D representation? How could that improve the workflow with revisions or generating variants? Can we generate assets that are more ready-to-go directly from concept art? That our professional artists can further adjust as needed?
I think as we iterate and improve these AI tools and learn how to integrate them into the current-day process, small indie studios and one-person teams might be empowered to extend their existing artistic talents into even more powerful workflows that give them back time while retaining or even improving quality.
Using the Thin-Plate Spline Motion Model & GFPGAN Approach in Game Dev
Thin-Plate Spline Motion Model & GFPGAN potentially have great utility for general purpose pre-production across industries. Being able to create fast, adaptive, and fairly/highly detailed imagery can help teams and clients find common ground to communicate more quickly and effectively. I think of it as of the difference between watching video playback vs waiting for dailies to be developed on film stock. New layers and structures of coordination can open up by shrinking the timeframe involved in feedback loops.
More specifically, it would be interesting to develop plugins for AI image synthesis directly into UE5 or Quixel. For example, Stable Diffusion can generate seamless textures from input images and text. A single photo of grass you take outside can be used as an input image and then changed with prompts into a set of several seamless textures of blue grass with alien flowers scattered through it. And then you can plug that into other tools to refine the art direction.
AI rendering could be an additional option to consider along with real-time and DCC rendering engines as well. One of the main differences would be that the AI render would be driven by what’s in the camera view. By using an underlying 3D animation as a scaffold to drive some kind of img2img process, we would be mainly art directing from the 2D projection of the 3D space.
That might open the door to exploring workflows that don't utilize the usual texture + UV Maps approach since we won’t need to exit the camera to make adjustments. I know for a lot of people coming into 3D, understanding the ins and outs of UV coordinates and projections in different 3D programs can be a fairly steep learning curve. We’re not there with the AI tech just yet, but if we can achieve more robust temporal consistency and less artifacting with an AI rendering engine, I can imagine creatives exploring the possibilities that might exist with a camera-driven process.
Additionally, if an AI rendering engine is only dependent on the 2D projection, that also means we could potentially apply the same techniques to live or animated footage. It could be a new set of tools to approach VFX and post-production.
All that said, what I'm really excited about is potentially merging the two flows together so we can automate the creation of assets. For example, we could also develop an extension for something like Substance Painter where we can match a 3D view to the reference perspective and simultaneously make painting adjustments to 2D and 3D space.
Another AI could take what’s being painted in the 2D reference frame and project it across geometry and paint whatever is missing into a seamless texture. And set up the UI so that it’s easy for texture artists to go in and further edit as necessary. By going back and forth between AI and more traditional 3D workflows, I think we could use the strengths of both to navigate the weaknesses of both.
Ultimately the goal would be to find ways to help team members work more efficiently so we can hopefully mitigate burnout and spend more time on areas of work that could use more attention. We can put more energy into locking in final aesthetics or polishing UI/UX, testing, managing technical challenges, or marketing/social media/community building.