In a paper titled “Speech2Face: Learning the Face Behind a Voice“, a team of researchers examines an approach that could allow defining facial attributes using audio recordings. “How much can we infer about a person’s looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking”, states the description.
The team behind the paper designed and trained a deep neural network that perform this task using millions of natural Internet/YouTube videos of people speaking. During training, their model learned voice-face correlations that lets it produce images that “capture various physical attributes of the speakers such as age, gender and ethnicity”. This is said to be done in a self-supervised manner, utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly.