CEO and Co-Founder of Speech Graphics, Gregor Hofer, spoke about modern-day facial animation and discussed the options available for animators.
Facial animation plays a key role in bringing characters to life in video games and movies through verbal and non-verbal communication. As technology advances, developers have access to an increasingly wide range of animation techniques to help them create more convincing experiences. In this article, we speak to the CEO and Co-Founder of Speech Graphics, Gregor Hofer, to understand the landscape of facial animation and what options are out there for animators.
We delve into the pros and cons of three of the most popular techniques: Audio-Driven Animation, Performance Capture, and Single Phone Capture, and look at how Speech Graphics has been working with clients to create hybrid solutions to get the best possible results.
Audio-Driven Facial Animation
Audio-driven facial animation is a technique that uses nothing but audio clips to create realistic facial animation. Most solutions use visemes to represent the key poses in observed speech however, Speech Graphics uses a unique approach using audio-driven technology to match sounds to muscle maps. Using a muscle map increases the accuracy of lip-sync and allows for the manipulation of the soft tissue around the mouth and even the tongue.
Speech Graphics has two main products, SGX which automates high-fidelity facial animation from pre-recorded speech, and SG Com which animates live speech in real-time. It is a multi-award-winning animation technology house and has helped countless studios accelerate production and elevate standards, across games such as Hogwarts Legacy, The Callisto Protocol, High On Life, and The Last Of Us Part 2.
- Scalable and cost-effective: relying on only voice files to generate animation makes it easy to scale across animation pipelines and character rigs. It also reduces or eliminates the need for full performance capture, which is subject to a higher cost base than productions that rely more heavily on voice acting.
- Editing flexibility: in SGX director characters’ facial expressions can be easily crafted and edited after the first pass. This enables game developers to focus on fine-tuning their character animation.
- Precise and realistic lip sync: audio-driven facial animation excels at accurately syncing character lip movements to dialogue creating realistic expressions. With the ability to manipulate the musculature in Speech Graphics software, even the tongue can be animated, providing new levels of accuracy not possible in many other animation techniques.
- Real-time and connection to TTS and AI: audio-driven facial animation can be used in real-time while maintaining a high degree of accuracy. Characters can also connect to text-to-speech applications or different types of AI to create characters that are dynamic and responsive in real-time. As this technology continues to improve so does the realism of real-time characters.
- Localization into different languages: the scalability of the solution also makes it very easy to switch the language across an entire game while keeping dialectically accurate facial animations.
- Likeness is harder: likeness is more difficult to achieve from just audio because an actor can use multiple combinations of muscles to express the same line. Therefore, motion capture can be better at replicating an actor's specific facial movements.
Facial Performance Capture
Facial Performance Capture (or PCap) is a highly accurate way to replicate actors' facial expressions. It uses a sound stage and professional equipment and is typically part of VFX production in film, TV, and games.
It tends to be a two-stage process. First, you need to capture actors, and then you need to do a post-production "cleanup". It is widely used in Hollywood by studios such as Weta, ILM, Centroid, and Imaginarium, while major game publishers like EA, Rockstar, and Ubisoft also have their own motion capture set-ups. However, this option might not be practical for many developers due to budget constraints.
- Capturing likeness: Facial performance capture allows for the highly accurate replication of an actor's facial expressions and movements. This level of detail enhances the realism and authenticity of the characters.
- Noise capture: although PCap is very good at capturing likeness, it can be noisy due to occlusions and interference making the clean up process more laborious.
- Very expensive: PCap requires expensive professional equipment and multiple teams for the different stages of production.
- Limited editability: it is best to use trained actors to get a good performance; however, that leaves you with limited editing capabilities if the performance falls short, or the script changes further down the line.
- Retargeting problem: if the facial model has a different shape to the performing actor’s face, then quality degrades significantly due to having to retarget the capture.
- Difficulties with the mouth: using PCap it can be difficult to capture the lip movements due to the soft tissue, and it also does not capture the movement of the tongue.
- No real-time: with the two-stage process and large team needed, it is impossible to create real-time animation using facial performance capture.
Single-Camera Phone Capture
Single-camera phone capture offers a more accessible approach. Rather than using specialized equipment and crew, it uses a consumer-grade device, such as an iPhone, to record the user’s face. It typically runs in a production setting, but there are real-time versions available as well. Currently, it is being used by streamers and hobbyists and is making some inroads into gaming, some of the better-known examples of it are using MetaHuman Animator, Move AI, and Deepmotion.
- Affordable: as it only requires a phone it is accessible to almost anyone.
- Near-real time: unlike PCap it offers nearly real-time animation feedback.
- Lower quality: it can be difficult to get a compelling performance, most people are not actors so animations tend to be flat.
- Lack of scaling and flexibility: editing capabilities are currently very limited and it is not scalable to large amounts of content.
- Lack of accuracy: due to relying on one camera, accuracy is constrained and occlusions are common.
- Retargeting and facial intricacies: just like PCap, single-camera phone capture also struggles with the retargeting problem and with catching the soft tissue around the mouth, furthermore it also does not animate the tongue.
While Speech Graphics clients use an audio-driven approach to produce the vast majority of facial animation for their productions, many studios opt for a hybrid approach, using a blend of PCap and audio-driven animation. PCap can capture the likeness of an actor, and audio-driven can fill in the intricate details that are missed during PCap. Furthermore, PCap poses can also be imported to drive the same expressions from audio as the actor did in the PCap session.
A common solution is to use the face from a PCap session but use audio-driven animation to animate the tongue, increasing the realism and making a more immersive experience. For example, in Hogwarts Legacy, WB Avalanche produced the upper facial animation from PCap but used Speech Graphics' audio-driven animation to generate the lip sync. They used this approach for all of their gold cut scenes. However, they used audio-driven animation on its own for all their silver and bronze cut scenes, which were the vast majority of the animations. This approach allowed them to fully localize across hundreds of thousands of lines of dialogue across eight languages while maintaining believable emotional expression and dialectically accurate lip-sync.
By reducing the need to animate at the keyframe level by hand, the Avalanche team was able to focus their efforts on fine-tuning the experience to maximize engagement and immersive storytelling.
A weighted blend between both systems can also be used, giving the animator a choice between PCap and audio-driven in every frame. Although this is a labor-intensive process, it allows the animator to fine-tune every last detail of the face for unparalleled accuracy. This was the approach taken by Naughty Dog in The Last of Us Part 2.
As you can see, all the facial animation approaches have their pros and cons, and which one animators choose will be based on their budget and priorities. While PCap can provide incredible likeness, it is expensive, less flexible, and runs into difficulties with the mouth. On top of that, it does not offer real-time options and is difficult to scale or localize to different languages. While single-camera phone capture offers affordability and near-real time, its reliance on one camera means it falls short in lip sync accuracy and may not be the best option for gaming studios.
Speech Graphics has developed some of the most advanced audio-driven animation that is on the market today. It is flexible and scalable, creates precise and realistic lip sync, and can be localized to languages with ease. Furthermore, it can be used in combination with PCap to enhance different parts of the face, creating an overall better experience.