
NVIDIA's New Model for Multimodal Conditional Image Synthesis

The new model allows mixing different inputs like text, segmentation, sketch, or style reference.

NVIDIA presented a new synthesis model that allows users to use multiple modules to generate new images. The team noted that existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, sketch, or style reference. The problem is these models are usually limited to a single input so you can use text and sketch but can't use both modules. To deal with this limitation, the NVIDIA team developed the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can generate images based on a desired set of modules.

"PoE-GAN consists of a product-of-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity," wrote the team. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting."

A few weeks ago, NVIDIA had also presented a similar AI called GauGAN 2, the main feature of which is the ability to turn a simple written phrase, or sentence, into a photorealistic image using deep learning.

You can learn more about PoE-GAN here (the code is coming soon).

