The model can achieve high-quality zero-shot video synthesis capability with superior photorealism and temporal consistency.
A team of researchers from NVIDIA, University of Chicago, and University of Maryland has unveiled PYoCo, a large-scale text-to-video diffusion model built upon the foundations of eDiff-I, a cutting-edge image generation model, with the addition of a novel video noise prior.
According to the developers, the model incorporates various effective techniques from prior studies, such as temporal attention, joint image-video fine-tuning, a cascaded generation architecture, and an ensemble of expert denoisers, surpassing other methods on numerous benchmark datasets. The paper shared by the team also highlighted the model's ability to achieve high-quality zero-shot video synthesis, boasting superior photorealism and temporal consistency.
"We propose a video diffusion noise prior tailored for fine-tuning text-to-image diffusion models for text-to-video synthesis," comments the team. "We show that fine-tuning a text-to-image diffusion model with this prior leads to better knowledge transfer and efficient training. On the small-scale unconditional generation benchmark, we achieve a new state-of-the-art with a 10× smaller model and 14× less training time. On the zero-shot MSR-VTT evaluation, our model achieves a new state-of-the-art FID of 9.73."
Learn more here. Also, don't forget to join our 80 Level Talent platform and our Telegram channel, follow us on Instagram and Twitter, where we share breakdowns, the latest news, awesome artworks, and more.