31 March 2025

ReadSpeaker's Evolution and Future Innovations: A Deep Dive Interview

The ReadSpeaker team discussed speechEngine's integration capabilities, cross-platform deployment strategies, and collaboration with industry giants Xbox and Sony.

In case you missed it

You may find this article interesting

Developing a Natural-Sounding Text-to-Speech Technology

How has ReadSpeaker evolved since we last talked? What have been some of the biggest achievements? Please tell us about the new features added, especially those that aim to enhance gaming experiences and the needs of game developers and gamers.

Pontus: Since we last checked in, we’ve been focusing on enhancing our text-to-speech tools to be the best they can for game developers and players alike. This year, we’ve been working on bringing our latest iteration of AI-trained, high-quality, realistic TTS voices to our SDK while also expanding the possibilities for integration by taking another look at the industry standard toolset and seeing where we fit in. This has led us to forging a partnership with Audiokinetic who we are now working with to bring a deep integration of ReadSpeaker’s speechEngine to Wwise.

Furthermore, we’ve also improved our TTS voices: we released our next-gen Neural voices in the speechEngine SDK, which now offers performance that’s on par with our previous generation, but with an audio quality and realism that far surpasses the previous generation. This means developers get to use more realistic-sounding voices with higher audio quality, and improved resource efficiency that is leveraged better in a runtime scenario.

By expanding our partnerships we have managed to create a text-to-speech tool that offers cross-platform capability running natively on-device during runtime. We hear a lot of enthusiastic feedback from games developers as this fits so well with their needs, especially since speechEngine is very lightweight, runs on a single thread on the CPU while still bringing high-quality, lifelike TTS to games.

Our R&D team is now putting a lot of effort into optimizing the new generation of TTS voices which will further improve performance, making footprint and CPU usage even lower while simultaneously improving voice quality.

On another note, we’re also aware of the importance of providing players with a seamless experience across consoles when it comes to accessibility. We’re continuing our work to deliver a completely cross-platform, integrate-once-and-deploy-anywhere tool for developers to bring TTS to any major gaming platform or console.

Since our last check-in we have added support for: PS4®, PS5®, Xbox Series X™ and Nintendo Switch™.

I’m also really excited to share a sneak preview of our soon-to-launch plugin for Wwise with you. This exciting news will be announced soon on the AudioKinetic and ReadSpeaker websites.

Can you share more about speechEngine’s integration capabilities with industry tools?

Pontus: When designing our tools, we wanted to bring a simple, user friendly interface to developers to do voice customization as well as integrating it into existing text-based systems to easily enhance these with speech. With that mindset, we developed deep integrations between speechEngine and Unreal Engine and Unity.

Offering plug-in solutions for these platforms means that developers can easily install and start using speechEngine directly from their favourite well-known interfaces in Unity and Unreal. This means being able to convert text into speech using only Blueprint nodes in Unreal Engine, or writing a couple of lines of C# in Unity to start producing speech.

While keeping the interface simple is good and all, we still offer all the detailed controls you need to manage the entire process from text to speech, but we wanted to make the more advanced controls optional. In essence, there are only 3 steps required to produce speech using the simple interface we provide, 1) Configure a ‘speaker’ component in Unity or Unreal Engine, 2) Initialize the plugin 3) Send text to be converted to speech.

Our cross-platform support ensures that whenever you’ve built it once, you are able to build and deploy to any target platform.

So far, we’ve got a lot of positive feedback on the usage part which has led us to put a large focus on the backend this year, working on optimization and porting to make sure our tool is available to as many developers — and ultimately— players as possible. We’ve also been hard at work on expanding the range of integration possibilities with the Wwise integration that’s set to launch soon. More on that later!

Supporting cross-platform deployment: How is the process structured, and which platforms do you support?

Pontus: Right now we support all of the major gaming platforms, including consoles, android/iOS, Windows and Linux. First order of business was to port speechEngine to every platform. Here we also need to take hardware limitations into account. And we have technology that allows us to scale down the output quality in favour of reduced resource requirements. This is something that we actively use on mobile devices to make sure that performance can meet real-time requirements across the board.

Once the porting work was done, we were able to bring it into our integrations for Unity and Unreal Engine, by utilizing the respective build systems we embed and deploy speechEngine as part of the packaged game at build time.

As you are now official members of the XBOX GDK program and the Sony Middleware program, please tell us how you ensure compatibility across these platforms, collaborate with these teams, and what benefits and performance updates you bring to the game developers.

Pontus: I think we share a common goal to help developers make games more accessible. This common goal allows us to share technical roadmaps and get our solution in the hands of developers.

Ensuring compatibility with Xbox® and PlayStation® consoles means that we have to make sure that speechEngine is fully functional on each platform and that it operates smoothly with Unity/Unreal when deploying to these platforms. For every platform we’re always considering what platform-specific features we can take advantage of in order to ensure best possible performance and functionality. We’re always looking for ways to make our product the best it can be on each platform.

We’ve seen ReadSpeaker mention "Ethical AI voice" technology. Can you explain what this means?

A: We believe in ethical AI voice technology for the gaming industry, and directly address the concerns raised by voice actors regarding the use of AI in voice acting. Unlike some AI voice cloning technologies that might exploit existing voice data without consent, we emphasize direct collaboration with voice actors. This means actors are actively involved in the process, providing their voices and giving explicit permission for how their voices are used and potentially modified. We apply AI in our TTS development to enhance the creative process and offer new possibilities in game development, while simultaneously upholding the rights and interests of voice actors.

Please share with our team some examples of how speechEngine can be used in different in-game scenarios.

Pontus: We mainly envision this as an accessibility-enhancing tool. When speaking of accessibility in games, TTS is commonly brought up as a subject. Of course, no matter what game you're playing, it’s bound to have text displayed on screen, and for players that have difficulty in taking in that text, TTS is an invaluable tool. Unfortunately, adding TTS games has not been an easy process in the past. Some developers we've spoken with that have used TTS for accessibility have been using TTS software to create the speech in audio file format and then slot in each individual file to be associated with a particular UI element. This process works, but it’s cumbersome and inflexible. Our vision was this: instead of manually associating each UI element with a particular audio file, what if you could associate this UI element with a “Speaker”? This Speaker would take the text displayed in that particular UI element, and as soon as you’d trigger the event which would previously have taken an audio file and played it, instead generate the speech audio on the fly and play that speech as soon as it was produced. This makes it much easier for developers to integrate TTS in their games. It also enables it early in the production pipeline instead of as an afterthought. The added benefits being that it generally reduces footprint as our voice engines are typically around 10-20MB, and allow for far more customization of the voice output. These benefits also carry over to the prototyping use cases, where you are rapidly iterating on character dialogue, and want to nail the timings.

Please share with us the upcoming/planned improvements and updates.

Pontus: In short, we’re working hard on optimizing our next-generation voice engine, which we think will showcase what we’re able to do in terms of realism and quality while retaining outstanding performance. I also teased a sneak-peek of the soon-to-come integration for Wwise earlier, so I figured we could jump into that now.

Wwise is an amazing tool for adding those extra touches to audio that really make a game stand out. When we looked at what Wwise does we saw a distinct use case for having our text-to-speech tool source the speech data to then be further processed by Wwise. In the beginning, we saw 2 possibilities. The first was producing the speech as audio files inside of the authoring tool, and the second, more interesting one was building a deep integration to allow the voice model to be used as an audio source in runtime. The second was more in line with the principles we have worked with previously, so it seemed like the most natural one, even if it meant expanding the scope of the integration quite a bit.

Essentially, speechEngine will be selectable as a source when configuring work units in Wwise. When exporting a work unit to a SoundBank the entire voice model will be embedded into the SoundBank, which can then be loaded in runtime. Furthermore, we take advantage of Wwise SDK features that allow for piping arbitrary data into a stream which is readable by the plugin. Using this we have managed to create a very flexible tool that can take text strings from the game runtime and use the generated SoundBanks to create the speech based on what text was sent to it.

In the last months we have been working closely with the team at AudioKinetic to ensure that this works seamlessly and across platforms. I’m happy to say that we’re now at a point where we are looking towards a near-term release. I don’t want to make any promises in regards to dates at this point, but we’re looking at a launch in the first half of 2025.

The ReadSpeaker Team

Ready to grow your game’s revenue?

Talk to us

Comments

0

Leave Comment

Ready to grow your game’s revenue?

Talk to us

The ReadSpeaker Team

Comments

0

We need your consent