Switch an object in the video with the one from a text prompt.
Researchers from The Chinese University of Hong Kong, SmartMore, and Adobe presented Video-P2P, a framework for real-world video editing with cross-attention control. Simply put, it can replace an object in the video with the one you specify in a text prompt.
The model adapts an image generation diffusion model to complete various video editing tasks. The creators propose to first tune a text-to-set model to complete an inversion and then optimize a shared embedding to achieve accurate video inversion.
"For attention control, we introduce a novel decoupled-guidance strategy, which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability, while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing."
These designs enable text-driven editing applications, including word swap, prompt refinement, and attention re-weighting. Video-P2P seems to work on real-world videos for generating new characters while preserving their original poses and scenes, but more data is needed to fully check the researchers' claims.
Meanwhile, you can read the paper and wait for the code here. Also, don't forget to join our Reddit page and our Telegram channel, follow us on Instagram and Twitter, where we share breakdowns, the latest news, awesome artworks, and more.