Speech Graphics: Creating Audio-Driven Facial Animation Technology

Speech Graphics, the company that delivers audio-driven facial animation technology to the entertainment industry, recently received $7 million in funding to fuel its technology that enables animated characters in games and applications to move their mouths adequately in real-time. We asked Speech Graphics CEO, Gregor Hofer, about the algorithm behind the tech and its main limitations, the Rapport platform, and the future plans of the company.

Speech Graphics

My co-founder Michael Berger and I met while studying for our PhDs in Speech Technology at the University of Edinburgh. We were both developing audio-driven character animation technology. Michael was working on facial animation and lip-sync technology from an articulatory and muscle dynamic perspective while I focused on machine learning and experimented with different model types. We saw a gap in the market for producing high-fidelity facial animation. When Speech Graphics was created in 2010, our initial goal was to enable companies to produce large amounts of high-fidelity facial animation.

The Funding Round

We received $7 million in funding to expand our company and to bring Rapport to market. The funding allows us to grow our company and at the moment we are in the process of hiring a Sales and Marketing team. We see a huge growth opportunity in enterprise AI systems for customer experience – the recent "buzz" around the Metaverse and avatars works in our favor. Right now, there is a lot of interest from brands to use new technology to differentiate themselves and we are at the forefront of high-quality, automated facial animation technology.

Audio-Driven Technology

Facial animation is among the most challenging aspects of computer graphics. Faces are intricate, subtle, and expressive – and speech is one of the most complex human behaviors. Since humans are innately attuned to faces, we are highly sensitive to anything unnatural or unexpected. Meanwhile, faces are frequently in focus so they account for a large proportion of content. These combined factors make a demand for automation of high-quality facial animation one of the key drivers of Speech Graphics. We specialize in audio-driven animation but we also take cues from other sources such as conversational AI.

Most people approach audio-driven animation from one area of expertise, whereas at Speech Graphics we have long viewed this as a multi-faceted problem requiring expertise in disparate fields such as speech technology, machine learning, linguistics, biodynamics, psychology, and computer graphics. Our goal is to achieve the illusion that the face you see is the source of the sound you hear. To do this, we have spent many years developing a stack of algorithms, and this R&D is always ongoing.

Main Limitations

One limitation is the quality of the facial rig that is used to visualize the output of our animation system. Our system can work with virtually any character that has parametric controls for the jaw, lips, tongue, and other parts of the face – collectively known as "rigging". During character setup, users establish a mapping from Speech Graphics muscles to the controls of their rig. But if the rig does not produce natural-looking deformations of the facial surface, then the result will be unrealistic or unnatural, no matter how natural and accurate the animation data we are feeding into the rig.

This is becoming less of a problem as the bar for facial animation has risen with other aspects of computer animation in games, causing more investment in good facial rigging. New character systems are also emerging, such as Epic’s MetaHumans – nearly photo-realistic and well-rigged character models free to use in the Unreal Engine. This has introduced a sort of democratization of good character rigs across the industry.

Another limitation with such tech is the quality of the training data used for machine learning. Garbage in, garbage out as the saying goes. Without good data, we would not have good models. Speech Graphics has a dedicated Data Engineering team who are responsible for curating, cleaning, and labeling our ever-growing data collections. We have developed highly specialized proprietary collection and labeling techniques that are advantageous for the kinds of muscle simulations and event detection we do. We also publish some of our work, especially in the area of speech emotion recognition.

Rapport

The Rapport platform enables animated digital characters on the web and in apps to foster human connections between end-users and brands. It is built to make it easy for any developer to integrate avatars into any type of experience. Rapport utilizes our core animation technology and connects chatbots, and voice services on the backend to enable developers to use the best-in-class technology for their specific vertical. Digital characters foster user engagement – for example, there are studies that show how e-commerce users are more likely to complete a transaction with an avatar present.

Facial Recognition

Our models are trained on large amounts of data spanning different languages, dialects, speakers, styles, and emotions to create a comprehensive universal mapping from acoustics to movement. Without giving too much away, we do look at different physical states of the user, as well as broad emotional states – and use a range of algorithms to determine the best muscle configuration at any given point in the signal. Our operating assumption is that all humans have the same vocal apparatus and biological makeup, regardless of language or culture. Our goal is to minimize the need for any manual intervention in the animation process so these nonverbal aspects of the performance are critical to our customers.

Roadmap

We have a number of exciting projects in the works for 2022, right now our focus is working on a tighter integration within the Unreal Engine for SGX, as well as providing interactive tools to give users more creative control over output. Also, we are working hard to bring the Rapport platform to market – right now we are in closed Alpha and aim to be able to show more later in the year.

Gregor Hofer, CEO and co-founder of Speech Graphics

Interview conducted by Ana Kessler

Join Top Creatives at 80 Level

Art by Valentin Erbuke

Custom rock brushes and alpha set, 18 brushes and height/alpha maps + 3 ztool rock meshes and a mini tutorial.