The tool can be used to animate both realistic and cartoony characters.
A team of researchers from Alibaba Group's Institute for Intelligent Computing has introduced Animate Anyone, a new diffusion models-based solution for consistent and controllable image-to-video synthesis for character animation.
According to the team's research paper, the framework utilizes ReferenceNet to merge detailed features through spatial attention, preserving the consistency of intricate appearance features from the reference image. To ensure controllability and continuity, the team introduced an efficient pose guider to direct character movements and employed an effective temporal modeling approach for smooth inter-frame transitions between video frames. The team also extended the training data to ensure the solution can handle both realistic and cartoonish humanoid characters.
"The pose sequence is initially encoded using Pose Guider and fused with multi-frame noise, followed by the Denoising UNet conducting the denoising process for video generation," the team explained. "The computational block of the Denoising UNet consists of Spatial-Attention, Cross-Attention, and Temporal-Attention."
"The integration of reference image involves two aspects. Firstly, detailed features are extracted through ReferenceNet and utilized for Spatial-Attention. Secondly, semantic features are extracted through the CLIP image encoder for Cross-Attention. Temporal-Attention operates in the temporal dimension. Finally, the VAE decoder decodes the result into a video clip."
Here are some more examples shared by the team: