A New Stable Diffusion-Based Text-to-Video Model by NVIDIA

The model is capable of generating videos consisting of 113 frames and render them at 24 FPS.

The NVIDIA Research team has introduced a new Stable Diffusion-based model for high-quality video synthesis, which enables its user to generate short videos based on text prompts. Powered by Latent Diffusion Models, the model was trained in a compressed lower-dimensional latent space, thus avoiding excessive compute demands, and is capable of creating 113 frames-long videos with a 1280x2048 resolution and render them at 24 FPS, resulting in 4.7 second long clips.

"We apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task," commented the team. "We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models."

"Our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation."

You can read the full research paper and check out more results shared by NVIDIA by clicking this link. Also, don't forget to join our 80 Level Talent platform and our Telegram channel, follow us on Instagram and Twitter, where we share breakdowns, the latest news, awesome artworks, and more.

Published 19 April 2023

Theodore McKenzie

Head of Content