Co-Founder and CTO of Speech Graphics Michael Berger told us about the company's history, discussed how their facial animation solutions SGX and SG Com work, and explained why AAA game developers use the solutions for their projects.
Introduction
Hi, I'm Michael Berger, Co-Founder and CTO of Speech Graphics. Our team brings together specialists in linguistics, machine learning, software engineering, and technical art to provide solutions (SGX and SG Com) that accelerate the production of high-quality facial animation from voice audio.
Our technology has been used to deliver accurate lip sync and believable facial expressions in games like The Last of Us Part II, Hogwarts Legacy, High On Life, Resident Evil Village, Fortnite, Blackwood Crossing, Gotham Knights, and Dying Light 2 Stay Human, and the most recent couple of entries in the Gears of War series.
The History of Speech Graphics
While studying linguistics in the mid-90s, I started working in a new field called 'visual speech synthesis', which applied models of speech articulation to computer graphics. This had both the scientific value of putting models to the test and the practical value of automating facial animation. I saw fundamental problems with the existing simplistic models and set out to do things differently. I made my first prototype end-to-end system in 2001. This led to some of the core ideas that Speech Graphics was later founded on.
I met my Co-Founder Gregor Hofer (CEO of Speech Graphics) at the University of Edinburgh in 2007, where we were both doing PhDs in Informatics at the Centre for Speech Technology Research. I was continuing my work there on visual speech and Gregor was working on a related area of nonverbal animation driven by speech. We both saw the great impact facial animation technology could have on the world.
In particular, we saw that this technology would solve a huge problem for the video game industry. Modern games can contain vast amounts of recorded voiceover, possibly repeated in many languages for localization. Animating all of this dialogue either by hand or through motion capture is very expensive and time-consuming. Audio-driven facial animation provides a fast, low-cost alternative. Other audio-driven software already existed but we believed that our approach was far superior in terms of quality. On that basis, we founded Speech Graphics Ltd as a university spinout in November 2010.
We worked very closely with the industry from the beginning, and their feedback was one of the main driving factors behind our ongoing R&D. Video game developers are some of the hardest customers to please because they are in a constant arms race over realism. Our mission was to fully automate high-quality animation, including lip sync, emotional and other non-verbal facial expressions, head motion, blinks, etc. We advanced our analyses to extract more and more information from the audio signal, and advanced our models to produce more naturalistic movements, to the point where we were really helping our customers.
Michael Berger and Gregor Hofer in the earlier days of Speech Graphics
The SGX Suite
SGX is our main production suite. It's used by developers looking to produce high-quality facial animation content. It essentially converts audio files into animation files, which can then be imported into game engines or other 3D animation systems.
It works for any language, including fictional ones. Optional transcripts can be used in 11 different languages, to support higher-quality output. It also works for any character after a one-time setup of a character control file, where the behaviors and idiosyncrasies of the character can be fine-tuned.
SGX 4 includes three modules:
- SGX Producer, the command-line tool for batch processing.
- SGX Director, a new desktop tool for interactive processing.
- SGX Studio, a growing suite of plug-ins for Maya, Unreal Engine, and soon other platforms, used for character setup, animation import, and visualization.
The main focus of SGX development over the years has been to optimize the automatic quality you get out of batch processing. But we've also invested a lot of R&D into giving users more creative control through interactive processing. This expanded functionality is available in the SGX 4 Director tool, which is coming out of a year-long beta at GDC 2023.
SGX Director gives users direct control over animation performances by exposing key metadata that gets created during SGX processing. By editing the metadata on an interactive timeline users will see immediate changes in the animation. For example, you can move word boundaries, change behavior modes, swap facial expressions, or tweak modifiers such as intensity, speed, and hyperarticulation of muscles. The metadata is high-level, which makes the editing process really fast and intuitive. You might forget it's an audio-driven solution as it starts to feel like a point-and-click creative tool.
SG Com
SG Com is our real-time, runtime solution. It converts an incoming stream of audio into facial animation with a latency of only 50 milliseconds. This has many uses, the most obvious of which is real-time chat through avatars. The most well-known example of this has probably been the player-to-player chat function in the Fortnite lobby. From the mic input, players could “puppeteer” their characters via SG Com. The player would speak, and their character’s face would animate accordingly in real-time.
Another typical use case is concerned less with the real-time aspect of SG Com and more with generating animation at runtime. Studios making games with large amounts of dialogue also face the problem of managing huge amounts of facial animation data. All that data takes up a lot of space and download bandwidth, especially in multiple languages, not to mention the time and effort involved in the processing and managing it all during game production. So why not generate the animation on the fly from the audio alone while the game is running? That way, there's no animation to process, store or download. New dialogue can be added by streaming audio only. It also makes localization of facial animation dead easy, since SG Com works for any language.
Another use case is generating listening behavior of characters that are in the vicinity of speaking characters. Our new "Speaker/Listener" demo at GDC will give visitors the opportunity to talk into a microphone to drive the speaking character, while a second character reacts to what they’re saying. This shows how the voice qualities analyzed by our system in real-time can be used differently depending on the role of the character. For example, if Character A is being aggressive, Character B might look scared and intimidated. However, Character C might be unafraid of Character A and instead present angry expressions.
With this demo, we're taking things a step further by showing how body animations can be triggered by SG Com too.
Speech Graphics will be demonstrating both production and real-time solutions live at GDC:
Technical Challenges
The first challenge in developing SG Com was to approximate the processing pipelines of SGX with a stream-based architecture. This included generating both lip sync and non-verbal behaviors in real-time. We had to make all of our algorithms low-latency to be able to operate in a communication setting. Speech is highly contextual and it can be very difficult to interpret current sound without capturing some buffer of what follows. Our 50-millisecond latency was a hard-won achievement. This low latency also means there can be sudden unanticipated changes in the voice which the character must quickly react to, which means it can't adhere too rigidly to some planned behavior. In addition, we had to make our software highly efficient to be able to run in-game, even driving many characters at once. And yet with all of these optimizations, we also wanted to avoid sacrificing too much animation quality!
By contrast, SGX is a file-based processing system running offline. Speed and efficiency are important but that is ultimately a desktop tool that gets a lot of leeway. We now work on both SGX and SG Com simultaneously, keeping them synced in terms of features, but obviously under these different constraints.
Another challenge with SG Com is dealing with an array of recording conditions. Unlike SGX which typically gets clean, studio-recorded audio files, SG Com may be used in real-world situations with a live mic. This means our algorithms must be noise-robust and be able to distinguish between a human voice and other sounds in the background, including an infinite variety of environmental sounds, noise, music, and even other voices!
Other challenges relate to the pipeline infrastructure in which SG Com is going to be used. For example, our system requires uncompressed audio, while some third-party audio components deliver only compressed audio in proprietary formats; so we must provide solutions to get the raw audio out. Syncing audio and animation playback can also be challenging if a network is involved, since while the offset between audio input and animation output is constant, adding network transport can make synchronization challenging if audio and animation are transmitted on independent channels. SG Com provides its own Player component in which received data packets can be buffered and resampled on the local playback timeline.
SGX is distributed on Windows both standalone and with plugins for Maya and Unreal Engine (pictured). SG Com comes as an engine-agnostic C SDK as well as an Unreal Engine plugin. We also offer sample projects for Unreal and Unity as part of the evaluation process.
The Team
The company has nearly 70 people now. We're a highly specialized bunch. We have a strong science program as well as engineering, and an art team that works closely with customers.
Our team includes linguists, data engineers, speech technologists, machine learning scientists, various kinds of software engineers from game engine programmers to cloud engineers, technical artists, character artists, animators, and people who try to figure out obscure things about human behavior.
Because for us, audio-driven facial animation is very interdisciplinary. Most people treat this problem from the perspective of one discipline or another – an animation problem or a speech technology problem. We're the only people crazy enough to pursue it from all avenues at once and that's why we're successful. We're also very collaborative and collegiate which makes us all smarter because we learn from each other.
In terms of geography, we have four offices now, with headquarters in Edinburgh, Scotland, and satellite offices in the US, Hungary, and Singapore. We count at least 20 different nationalities on the staff!
Success Stories
This sector is pretty secretive, which makes things difficult to talk about sometimes. There are some games coming out later this year that use SG Com in very interesting ways. I hope we’ll be able to talk about that at some point in the future. In terms of SGX, the recent launches of High On Life and Hogwarts Legacy are special for different reasons.
High On Life addresses one of the most common questions we’re asked: is our technology compatible with different art styles? The answer is a resounding yes. Our technology is truly agnostic about art styles. As long as the rig permits movements roughly analogous to human faces, we can animate it. Many of our demos use MetaHumans because the ability to achieve realism is something our technology facilitates. However, it’s clear that people want to see proof that we can drive more cartoony characters as well. (We'll be driving some anime characters at GDC to help make this point.)
Hogwarts, on the other hand, is a triumph in the way SGX has been used to process facial animation to the same high standard in 8 different languages. To the point that people were talking about it on Twitter.
Future Plans
We're continuously working on improving animation quality for both SGX and SG Com. This is a task that never ends.
Part of this is lip sync. We've just released new and improved Japanese 2.0 and Mandarin 2.0 modules for SGX. Korean 2.0 is next on the refinement rotation from our linguistics team. We're also working on some general improvements to all lip sync through advances in model architecture and training.
Another part of improving animation quality concerns better interpretation and animation of emotional content in the voice as well as non-speech vocalizations like grunts, breaths, laughs, etc. Our long-term goal is to correctly animate any sound that can come out of the human vocal tract, whether it's intelligible language or a snarl. We're continuing to work hard in this area. You can expect some significant new "auto modes" in both SGX and SG Com – meaning states that are automatically detected in the voice and trigger changes in behavior.
We're planning to further develop our capabilities with body animation this year. Hand gestures and movements of the head and body are highly correlated with speech, and we're working to provide better automation of that. Also, metadata from SGX and SG Com is going to be exploited in new ways to trigger body animations in-engine.
We'll also be making improvements to the character setup process. Character setup is very important because it determines the whole behavior of the character and so is the main lever the artist can use to have the broadest impact. We will work on removing the need to ever have more than one control file per character, so as to avoid duplicating work. And the behavior modes of the character are going to be enriched with various new features. Currently, a behavior mode contains only sets of facial expressions. But this year we'll be adding more parameters so that you can really use behavior modes as constellations of behavioral properties to control everything about the character that you would previously have controlled with separate modifiers.
Finally, we're planning a lot of work on tools and plugins. This includes more automation, improving UX, and expanding SGX and SG Com plugins for engines. Essentially, everything we do is designed to accelerate the delivery of high-quality animation, so that our clients can focus on bringing their creative vision to life as effectively and efficiently as possible.
There is some information on our website and we're working on a new Speech Graphics Knowledge Base for our customers. But one of the things we're most proud of is that, as we've grown, we've kept ourselves close to clients and stayed committed to delivering the best customer experience. It's from interactions with customers that we get direct feedback and decide what to prioritize in our development. So we still encourage our clients and prospects to talk to us directly whenever possible. It’s often faster and more conclusive.
After all, communication is at the core of what we do!