Leveraging Voice Analysis for Immersive Interactions with Speech Graphics' Metadata

The Speech Graphics team provided an overview of how its audio-driven facial animation software, SGX, generates metadata sequences that breathe life into characters and NPCs.

Intro

Speech Graphics' audio-driven facial animation software, SGX, not only produces animation but also exposes a rich tapestry of metadata, including emotional cues and other information extracted from speech. The SGX Director module, released last year, provides users the ability to edit metadata on the timeline, in order to refine animation outputs at a high level without editing keyframes. In the latest release, game developers can now also export metadata for their own use outside of SGX. This feature enhances character interactions in gaming, enabling game developers to imbue characters and game logic with a level of context-awareness never seen before.

The metadata dives deep into the patterns of rhythm and sound of speech, analyzing elements like intensity and pitch to discern underlying emotions. These analyses can then be used to influence behavior in-game, making characters react in a manner that truly reflects the emotional context of their dialogues.

The team at Avalanche Software recently leveraged Speech Graphics metadata while working on the Warner Bros. Games’ Hogwarts Legacy. The runtime usage of metadata allowed them to harness pitch, intensity, and prosody, along with word and pause alignments extracted from Event files. This data enabled them to automate body gestures that matched what the character was saying and transition in and out of speech maintaining facial emotions in the game, elevating the character performance quality across languages to unprecedented levels. For a deeper dive into what they did check out the case study.

See Emotion, Word Bucket, Pause, Tone, Intensity Metadata firing onscreen logging in real-time.

Breathing Life Into NPCs

Such capability not only enriches the gaming narrative but also has huge implications for NPCs, allowing studios to infuse game worlds with dynamic, responsive characters. Envision NPCs that grasp not just the words spoken by your character but also the emotions driving those words. This results in more genuine interactions, significantly deepening the player's immersion. This is where the Metadata Feature shines, transcending traditional animation enhancement to pioneer emotionally intelligent, lifelike game experiences.

This groundbreaking feature exemplifies Speech Graphics' dedication to expanding the horizons of animation and game design. It opens up a new chapter in interactive storytelling, promising a future where game worlds are more alive and emotionally engaging than ever before.

Extending the Experience with Technical Insights

Here is a sample of the metadata sequences to which developers have access, both in SGX Director for editing output, as well as exported to files for use outside of SGX.

Word alignment: Analysis of the timing of words in the audio. In SGX Director, word alignment boundaries may be shifted. In metadata export, word alignment may be used to generate accurately timed subtitles.

Phone alignment: Similar to word alignment, but also including subintervals for the phones (consonants and vowels) making up the words. In SGX Director, phone boundaries may be shifted. In metadata export, phone alignment may be used for speech processing such as de-essing.

Behavior modes: SGX behavior modes are triggered by auto-detection of emotions (like positive or negative) and vocal events (like efforts), as well as through markup. They may be edited in SGX Director, or exported to drive NPC reactions or other gameplay elements. 

Expressions: The automatically determined facial expressions drawn from the character's behavior modes. In SGX Director, users can get very specific in choosing which expression occurs when, or they can let the system decide. When exported, expressions can be used to drive corresponding body animations. 

Modifiers: A rich set of high-level behavior tweaks from the magnitude and speed of movements, to the average frequency of blinks. 

Prosodic Metadata: A set of analyses of intonation, stress, and intensity in speech. Developers may export this data to drive other analyses. Prosodic metadata includes intensity, pitch, stress, and phrases.

Join discussion

Comments 0

    You might also like

    We need your consent

    We use cookies on this website to make your browsing experience better. By using the site you agree to our use of cookies.Learn more