A New Stable Diffusion-Based Text-to-Video Model by NVIDIA

The model is capable of generating videos consisting of 113 frames and render them at 24 FPS.

The NVIDIA Research team has introduced a new Stable Diffusion-based model for high-quality video synthesis, which enables its user to generate short videos based on text prompts. Powered by Latent Diffusion Models, the model was trained in a compressed lower-dimensional latent space, thus avoiding excessive compute demands, and is capable of creating 113 frames-long videos with a 1280x2048 resolution and render them at 24 FPS, resulting in 4.7 second long clips.

"We apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task," commented the team. "We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models."

"Our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation."

