DiffPoseTalk is based on the diffusion model combined with a style encoder, enabling it to produce 3D facial animations.
A team of researchers from Tsinghua University, Beijing Jiaotong University, and Tianjin University have recently shared a research paper introducing DiffPoseTalk, a new AI-based speech-driven stylistic 3D facial animation and head pose generation framework. Based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos, the model has the ability to generate realistic-looking 3D facial animations by utilizing speech, a shape template, and reference styles as inputs.
"During inference, we employ classifier-free guidance to guide the generation process based on the speech and style," commented the team. "We extend this to include the generation of head poses, thereby enhancing user perception."
"Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Our extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods."
Learn more about DiffPoseTalk here and don't forget to join our 80 Level Talent platform and our Telegram channel, follow us on Instagram, Twitter, and LinkedIn, where we share breakdowns, the latest news, awesome artworks, and more.