Learn the details of the NVIDIA’s GAN-based approach used to convert semantic inputs into photorealistic videos.
NVIDIA has shared a paper presenting a new vid2vid method that helps generate photorealistic videos from semantic inputs using AI memory. For such an approach, the team has created a new tool that uses prior generated 3D world data and applies it to the new objects. To provide the world structure data, the team has introduced guidance images, physically grounded estimates that show how an object should look like.
As alluded to in their name, the role of these guidance images is to guide the generative model to produce colors and textures that respect previous outputs.
World-Consistent Video-to-Video Synthesis
In total, the study introduces a new architecture based on the multi-SPADE module that uses semantic models, optical-flow warping, and guidance images to create hyperrealistic and smooth videos.
Check the full paper on Github.