“There is an extensive literature on estimating 3D face shape, facial expressions, and facial motion from images and videos. Less attention has been paid to estimating 3D properties of faces from sound,” the team wrote in their paper. “Understanding the correlation between speech and facial motion thus provides additional valuable information for analyzing humans, particularly if visual data are noisy, missing, or ambiguous.”
The team used a dataset of 12 subjects and 480 sequences of about 3-4 seconds each to train a deep neural network model on NVIDIA Tesla GPUs, with the cuDNN-accelerated TensorFlow deep learning framework (Voice Operated Character Animation).