Developing a Natural-Sounding Text-to-Speech Technology

ReadSpeaker’s Lead Developer Pontus Melin and Business Development Director Tom Dipietropolo told about their text-to-speech technology, shared how the tech can be used in games, and revealed the algorithms behind TTS.

About the Company

Pontus Melin: My name is Pontus Melin, I am a Lead Developer for ReadSpeaker’s Game Engine Plugin. I studied computer science at Uppsala University, in Sweden and currently work on the development of the ReadSpeaker game engine plugins for Unity and Unreal Engine, within ReadSpeaker’s Swedish development team.

Tom Dipietropolo: I am Tom Dipietropolo, Business Development Director. I studied business administration at Salem State University and now I lead the Business Development initiatives at ReadSpeaker for our accessibility in Game Development initiatives.

We have two R&D centers, one in Seoul, Korea, and the other in Uppsala, Sweden. In each of them, we have speech scientists, linguists, and computer engineers who combine elements of human speech with the latest technology along with empirical data garnered over 20 years of developing TTS technology. We create solutions with our customers in mind and our focus has always been on providing the highest quality, most lifelike voices made with groundbreaking, best-in-class technology.

What is TTS?

Text to Speech (TTS) – or synthetic speech technology – is a software that takes any given text as an input and automatically converts it into audible speech. At ReadSpeaker, we focus on accurate and natural-sounding, human-like TTS. We have a range of solutions that convert text into speech in a variety of sound qualities, with different computational and memory requirements.

Using the Technology in Games

Text to speech can help improve accessibility for blind and low-vision users by narrating on-screen UI, visual cues, and dynamically generated text like chat messages in multiplayer games.

There is an increasing demand for accessible games. Xbox recently published their XAG (Xbox Accessibility Guidelines) outlining the best practices in this field. TTS can be applied to several of them, including Audio Description, Screen Narration, and TTS chat. Once the game is set up for screen narration, any updates to text elements in the UI will automatically be reflected in the generated speech. This means that TTS can be implemented early without having to worry about wasting time or resources when iterating the UI design.

Digital voices are also great for prototyping voice acting in the early stages of development. By using TTS to prototype voice acting, developers can save a lot of time and money during the development phase. For instance, you can use the technology, when you’re creating a cutscene that revolves around dialogue, animations, effects, and camera panning – all of them need to be timed with the dialogue itself.

Let’s say that there’s a change in the dialogue script. In order to implement the new dialogue, the studio would need to record new voice lines for the updated dialogue before making the appropriate adjustments to the surrounding elements. Our runtime integrated TTS enables developers to skip this process entirely. The idea is that by hooking into the game engine's sound subsystem and having our TTS integrated into the runtime, the voiceover will automatically reflect the updated dialogue script. So there’s no need to bounce back and forth between adjusting the script and creating new voice audio.

We also provide the viseme information for the generated speech. Visemes describe the mouth and lip movements of speech, and by having access to this information developers can easily automate lip-sync for any character using TTS-generated speech.

The Algorithm Behind the Tech

There are different ways of making TTS: from the traditional method of concatenating recordings to using the latest AI-driven technology. ReadSpeaker builds TTS voices using 5 different technologies. The following ReadSpeaker product names refer to these technologies and their characteristics:

USS HQ is a high-quality TTS product that uses the unit selection synthesis technique, which synthesizes speech by concatenating small speech units.
HMM Micro uses a statistical acoustic model called HMM (Hidden Markov Model) and parametric vocoder. Its footprint is very small and low computing power is required, which makes it adequate for embedded systems. However, it doesn’t ensure great audio quality and can sound muffled and robotic.
DNN Micro is an upgraded version of HMM Micro. The HMM model is replaced with DNN (Deep Neural Network). It provides better quality while maintaining most of the advantages of HMM Micro, but requires more computing power.
DNN HQ Micro is a higher-quality version of DNN Micro. In this technology, the parametric vocoder of DNN Micro is replaced with a neural vocoder which can generate higher quality speech. It requires much more computing power but can run fast enough without a GPU. It is usually suitable for mobile phones, PCs, and servers rather than low-spec embedded systems.
DNN HQ Cloud is the highest quality TTS, which makes full use of the latest DNN technology, including the end-to-end acoustic model and parallel neural vocoder. It has a large footprint and requires high-end GPUs, which makes it more suitable for cloud environments.

All DNN products are really useful in the gaming field because they are able to generate high-quality speech output and support a variety of emotions or speaking styles, and have easy prosody control.

The main challenge in creating TTS that’s suitable for gaming is to achieve realistically high voice quality that runs fast enough, even in low-spec embedded systems. Another challenge is improving the way TTS voices generate and control a range of emotions and prosodic changes. Our team is working hard to achieve these goals through continuous optimization and the development of new algorithms.

Dealing with Accents

Our TTS technologies use data-based trained models that make it possible to reproduce an accent by recording a voice with that accent. An accurate lexicon containing the relevant vocabulary has to be built in advance. ReadSpeaker has consolidated its own expertise by creating lexicons for a variety of regional accents over the years. We continue to expand these resources.

The challenge is that speakers often mix different accents when they speak. In this case, problems can arise due to mismatches between actual pronunciation and lexical pronunciation. We are currently researching how to deal with these types of issues.

TTS for Nintendo Switch

ReadSpeaker made 11 languages available to Nintendo Switch developers, leveraging our DNN Micro Engine.

The use case includes Dynamic Text to Speech, UI Narration, Audio Description, NPC Conversations, etc. In particular, the TTS voices enable characters to speak freely, with emotional range, rather than the game following a predetermined scenario.

For news on forthcoming deployments, we invite you to follow us on our website and our social media channels.

TTS in Virtual Environments

In online multiplayer games, communication is key. Having our dynamic runtime TTS integrated into a game enables everyone to communicate, even gamers who have traditionally been left out of voice communication in gaming. Since we have a large selection of voices in many different languages, a user can select a voice that resonates with them. We also allow the tweaking of parameters such as speed and pitch, so users can make their voice persona even more personalized.

And then, there is dynamically generated voice acting. This is a feature that is set to become increasingly relevant as TTS technology progresses and voices sound even more realistic. Plenty of games rely on some sort of dynamically generated content. A common example is generating a crowd of NPC’s roaming the streets. Due to the amount of manual work it would involve to create every single character that you see, dynamic generation is often a common method to provide the illusion of a real crowd. By using TTS, developers can give dynamically generated characters' voices and have each of them sound unique by utilizing voice customization, for instance, by tweaking parameters like pitch.

Limitations

Performance is a limiting factor at present. Utilizing the really high-quality voices (such as those that you can hear on our website) requires a lot of GPU power. This means that we can’t realistically embed our highest quality voices in games today as it would be too resource-intensive for those cases where fast response times are crucial. It would also contend with the same resource that renders the game, which could cause some issues for graphics-intensive games.

Pontus Melin and Tom Dipietropolo, Lead Developer & Business Development Director at ReadSpeaker

Interview conducted by Arti Sergeev

Keep reading

You may find these articles interesting