logo80lv
Articlesclick_arrow
Research
Talentsclick_arrow
Events
Workshops
Aboutclick_arrow
profile_loginLogIn

Microsoft's Neural Model VALL-E Can Turn Text to Speech Using Three-Second Audio Prompt

It can even maintain a range of emotions.

Microsoft has unveiled a fascinating neural model called VALL-E, which can synthesize speech based on a text prompt and three-second audio. While text-to-speech (TTS) synthesis is nothing new, the fact that AI needs so little time to create a believable recording is astounding.

According to the paper, the researchers trained the neural codec language model on 60K hours of English speech, which resulted in VALL-E "significantly outperforming the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity."

The model can even preserve the speaker's emotion and the acoustic environment of the original audio prompt.

You can listen to what the system is capable of here. Also, don't forget to join our 80 Level Talent platformour Reddit page, and our Telegram channel, follow us on Instagram and Twitter, where we share breakdowns, the latest news, awesome artworks, and more.

Join discussion

Comments 0

    You might also like

    We need your consent

    We use cookies on this website to make your browsing experience better. By using the site you agree to our use of cookies.Learn more