CEO and Co-Founder of Speech Graphics, Gregor Hofer, spoke about modern-day facial animation and discussed the options available for animators.
Facial animation plays a key role in bringing characters to life in video games and movies through verbal and non-verbal communication. As technology advances, developers have access to an increasingly wide range of animation techniques to help them create more convincing experiences. In this article, we speak to the CEO and Co-Founder of Speech Graphics, Gregor Hofer, to understand the landscape of facial animation and what options are out there for animators.
We delve into the pros and cons of three of the most popular techniques: Audio-Driven Animation, Performance Capture, and Single Phone Capture, and look at how Speech Graphics has been working with clients to create hybrid solutions to get the best possible results.
Audio-driven facial animation is a technique that uses nothing but audio clips to create realistic facial animation. Most solutions use visemes to represent the key poses in observed speech however, Speech Graphics uses a unique approach using audio-driven technology to match sounds to muscle maps. Using a muscle map increases the accuracy of lip-sync and allows for the manipulation of the soft tissue around the mouth and even the tongue.
Speech Graphics has two main products, SGX which automates high-fidelity facial animation from pre-recorded speech, and SG Com which animates live speech in real-time. It is a multi-award-winning animation technology house and has helped countless studios accelerate production and elevate standards, across games such as Hogwarts Legacy, The Callisto Protocol, High On Life, and The Last Of Us Part 2.
Pros:
Cons:
Facial Performance Capture (or PCap) is a highly accurate way to replicate actors' facial expressions. It uses a sound stage and professional equipment and is typically part of VFX production in film, TV, and games.
It tends to be a two-stage process. First, you need to capture actors, and then you need to do a post-production "cleanup". It is widely used in Hollywood by studios such as Weta, ILM, Centroid, and Imaginarium, while major game publishers like EA, Rockstar, and Ubisoft also have their own motion capture set-ups. However, this option might not be practical for many developers due to budget constraints.
Pros:
Cons:
Single-camera phone capture offers a more accessible approach. Rather than using specialized equipment and crew, it uses a consumer-grade device, such as an iPhone, to record the user’s face. It typically runs in a production setting, but there are real-time versions available as well. Currently, it is being used by streamers and hobbyists and is making some inroads into gaming, some of the better-known examples of it are using MetaHuman Animator, Move AI, and Deepmotion.
Pros
Cons
While Speech Graphics clients use an audio-driven approach to produce the vast majority of facial animation for their productions, many studios opt for a hybrid approach, using a blend of PCap and audio-driven animation. PCap can capture the likeness of an actor, and audio-driven can fill in the intricate details that are missed during PCap. Furthermore, PCap poses can also be imported to drive the same expressions from audio as the actor did in the PCap session.
A common solution is to use the face from a PCap session but use audio-driven animation to animate the tongue, increasing the realism and making a more immersive experience. For example, in Hogwarts Legacy, WB Avalanche produced the upper facial animation from PCap but used Speech Graphics' audio-driven animation to generate the lip sync. They used this approach for all of their gold cut scenes. However, they used audio-driven animation on its own for all their silver and bronze cut scenes, which were the vast majority of the animations. This approach allowed them to fully localize across hundreds of thousands of lines of dialogue across eight languages while maintaining believable emotional expression and dialectically accurate lip-sync.
By reducing the need to animate at the keyframe level by hand, the Avalanche team was able to focus their efforts on fine-tuning the experience to maximize engagement and immersive storytelling.
A weighted blend between both systems can also be used, giving the animator a choice between PCap and audio-driven in every frame. Although this is a labor-intensive process, it allows the animator to fine-tune every last detail of the face for unparalleled accuracy. This was the approach taken by Naughty Dog in The Last of Us Part 2.
As you can see, all the facial animation approaches have their pros and cons, and which one animators choose will be based on their budget and priorities. While PCap can provide incredible likeness, it is expensive, less flexible, and runs into difficulties with the mouth. On top of that, it does not offer real-time options and is difficult to scale or localize to different languages. While single-camera phone capture offers affordability and near-real time, its reliance on one camera means it falls short in lip sync accuracy and may not be the best option for gaming studios.
Speech Graphics has developed some of the most advanced audio-driven animation that is on the market today. It is flexible and scalable, creates precise and realistic lip sync, and can be localized to languages with ease. Furthermore, it can be used in combination with PCap to enhance different parts of the face, creating an overall better experience.