The AI creates 4-second clips of 32 frames.
Take a look at CogVideo – an algorithm that can generate short videos based on text input. It's trained by inheriting a pretrained text-to-image model, CogView2.
Like with DALL-E, you can type what you want to get, and the model creates an output, but this time – in form of a 4-second video of 32 frames. It's trained on 5.4 million text-video pairs and can make videos of pretty good quality.
The researchers also proposed a multi-frame-rate hierarchical training strategy to better align text and video clips. The original input was done in Chinese, but it should also work well in other languages.
Check out the research on GitHub. Also, don't forget to join our new Reddit page, our new Telegram channel, follow us on Instagram and Twitter, where we are sharing breakdowns, the latest news, awesome artworks, and more.