Understanding Real-Time Facial Animation

Dimitrios Kyteas, the Head of Visualization Engineering at Speech Graphics, has offered an in-depth look at the company's SG Com, discussed the challenges of developing this real-time animation tool, and explained how the solution can be integrated into game engines.

Introduction

My name is Dimitrios Kyteas, and I'm the Head of Visualization Engineering at Speech Graphics. My usual responsibilities include:

Development of software tools used by our Creative Team and customers for both offline and real-time animation.
Integrating SG Com with various systems (i.e. Unreal Engine, Unity3D, WebGL, interaction with audio systems, etc.).
Supporting customers with integrating our tech into their pipelines.
Develop plugins and samples that facilitate integrations.
Develop demo apps that showcase our real-time tech.

Speech Graphics & SG Com

SG Com is a solution designed to enhance communication and expression in virtual characters and avatars. The technology focuses on real-time facial animation and lip-syncing, enabling virtual characters to accurately synchronize their speech movements with the audio input.

It uses advanced algorithms to analyze the audio and generate corresponding facial animations in real-time. This allows for more realistic and natural communication between virtual characters and users in various applications such as video games, virtual reality (VR), augmented reality (AR), and animated movies.

The Story & Development of SG Com

The foundation of SG Com is our tried and tested SGX technology. SGX is an award-winning automatic speech-to-facial animation system, built on over 20 years of R&D in speech technology, linguistics, machine learning, and procedural facial dynamics. It is already used by 90% of AAA publishers and features in games such as Hogwarts Legacy, The Last of Us Part 2, and Resident Evil: Village.

The first challenge in developing SG Com was to approximate the processing pipelines of SGX with a stream-based architecture. This included generating both lip sync and non-verbal behaviors in real-time. We had to make all of our algorithms low-latency to be able to operate in a communication setting. Speech is highly contextual and it can be very difficult to interpret current sound without capturing some buffer of what follows. Our 50-millisecond latency was a hard-won achievement. This low latency also means there can be sudden unanticipated changes in the voice which the character must quickly react to, which means it can’t adhere too rigidly to some planned behavior.

In addition, we had to make our software highly efficient to be able to run in-game, even driving many characters at once. And yet with all of these optimizations, we also wanted to avoid sacrificing too much animation quality!

By contrast, SGX is a file-based processing system running offline. Speed and efficiency are important but that is ultimately a desktop tool that gets a lot of leeway. We now work on both SGX and SG Com simultaneously, keeping them synced in terms of features, but obviously under these different constraints.

Another challenge with SG Com is dealing with an array of recording conditions. Unlike SGX which typically gets clean, studio-recorded audio files, SG Com may be used in real-world situations with a live mic. This means our algorithms must be noise-robust and be able to distinguish between a human voice and other sounds in the background, including an infinite variety of environmental sounds, noise, music, and even other voices!

Other challenges relate to the pipeline infrastructure in which SG Com is going to be used. For example, our system requires uncompressed audio, while some third-party audio components deliver only compressed audio in proprietary formats; so we must provide solutions to get the raw audio out. Syncing audio and animation playback can also be challenging if a network is involved, since while the offset between audio input and animation output is constant, adding network transport can make synchronization challenging if audio and animation are transmitted on independent channels.

SG Com provides its own Player component in which received data packets can be buffered and resampled on the local playback timeline.

Use Cases

SG Com can be used in any scenario that involves 3D facial animation. Gaming is an obvious use case but its true utility becomes apparent in applications that involve avatar communication. Chat apps (including VR), customer support applications, chatbot integrations, and pretty much anything that requires 3D facial animation and dynamic audio (microphone input, text-to-speech, frequently updated audio files).

SG Com is performant and versatile enough to be used on multiple platforms (phones, consoles, PC), including browsers. In the video below you can see SG Com being used in Fortnite.

SG Com's Benefits & Advantages

SG Com can be used in real-time applications (i.e. chat apps or multiplayer games), but also applications that would benefit from runtime generation of animation.

Studios making games with large amounts of dialogue also face the problem of managing huge amounts of facial animation data. All that data takes up a lot of space and download bandwidth, especially in multiple languages, not to mention the time and effort involved in processing and managing it all during game production. By generating the animation at runtime, there’s no animation to process, store, or download. New dialogue can be added by streaming audio only.

It also makes localization of facial animation dead easy, since SG Com works for any language and even works for non-speech sounds (i.e. grunts, laughs, etc.). SG Com can also be used to generate listening behavior, based on audio input. This opens up use cases like multiple characters reacting differently to one speaker (ie some could be afraid, others enraged, etc.).

Integrating the Solution Into Game Engines

For Unreal Engine and Unity, we provide sample projects that make it possible to integrate SG Com without dedicating engineering resources. For Unreal Engine, we also provide a plugin that exposes the functionality to the Blueprint layer, the visual programming utility of the Editor which makes integration even easier.

Customers who do not use Unity or Unreal Engine need to integrate SG Com into their pipeline themselves. SG Com is delivered as a C API (Application Programming Interface) and a sample C++ program that demonstrates a basic use case.

The API contains functions that operate on two main concepts: the Engine and the Player. The Engine functions enable the user to input audio data to SG Com, trigger the animation data generation, and configure settings like behavior modes (sad, happy, etc.), automatic behavior mode detection, the intensity of expressions, generation of idle animation (animation when no audio is provided), etc.

The Player functions are used to retrieve the final animation data that can then be applied to the 3D model at runtime. While the API is fairly straightforward it does require someone with C/C++ programming experience and knowledge of the Game Engine to successfully use it. Customers are usually able to integrate SG Com in a matter of days and most issues can be resolved through email or a call.

Thoughts on Generative AI

The rise of large language models has revolutionized our customers' ability to generate meaningful content from conversational AI (chatbots), and in conjunction with synthetic voice, that new content directly enhances a variety of SG Com applications. Speech Graphics also has a deep research arm focusing on the potential of the visual aspects of generative AI, including powerful diffusion models in animation and character creation. Stay tuned...

Dimitrios Kyteas, Lead Software Engineer at Speech Graphics

Interview conducted by Arti Burton

Keep reading

Previous stories from the Speech Graphics team