Courtesy of Google and its new VLOGGER video generation model.
A couple of days ago, the Google Research team officially introduced its new AI-powered generative model capable of creating controllable videos of variable length from a single human photo. Dubbed VLOGGER, the model does not require training for each person and does not rely on face detection and cropping, capable of generating the complete image showing a target human talking and moving, including head and gestures.
"Our framework is a two-stage pipeline based on stochastic diffusion models to model the one-to-many mapping from speech to video," the team wrote in the research paper. "The first network takes as input an audio waveform to generate intermediate body motion controls, which are responsible for gaze, facial expressions, and pose over the target video length. The second network is a temporal image-to-image translation model that extends large image diffusion models, taking the predicted body controls to generate the corresponding frames. To condition the process to a particular identity, the network also takes a reference image of a person."
Moreover, the AI was revealed to be able to edit existing videos, leveraging its diffusion model to change the expression of the subject. Lastly, the team revealed that the model can handle video translation, modifying the lip and facial areas of an existing video to match new audio content in a different language.
Learn more about VLOGGER here and don't forget to join our 80 Level Talent platform and our Telegram channel, follow us on Instagram, Twitter, and LinkedIn, where we share breakdowns, the latest news, awesome artworks, and more.