It can even maintain a range of emotions.
Microsoft has unveiled a fascinating neural model called VALL-E, which can synthesize speech based on a text prompt and three-second audio. While text-to-speech (TTS) synthesis is nothing new, the fact that AI needs so little time to create a believable recording is astounding.
According to the paper, the researchers trained the neural codec language model on 60K hours of English speech, which resulted in VALL-E "significantly outperforming the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity."
The model can even preserve the speaker's emotion and the acoustic environment of the original audio prompt.
You can listen to what the system is capable of here. Also, don't forget to join our 80 Level Talent platform, our Reddit page, and our Telegram channel, follow us on Instagram and Twitter, where we share breakdowns, the latest news, awesome artworks, and more.