More terrifying news from the AI world: complex algorithms can now reconstruct a facial image from an audio recording of a person speaking.
In a paper titled “Speech2Face: Learning the Face Behind a Voice“, a team of researchers examines an approach that could allow defining facial attributes using audio recordings. “How much can we infer about a person’s looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking”, states the description.
The team behind the paper designed and trained a deep neural network that perform this task using millions of natural Internet/YouTube videos of people speaking. During training, their model learned voice-face correlations that lets it produce images that “capture various physical attributes of the speakers such as age, gender and ethnicity”. This is said to be done in a self-supervised manner, utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly.
The paper studies the whole idea in depth and evaluates how reconstructions resemble the true face images of the speakers. You can learn more about the work and find the full paper here.