Version 2.0 incorporates textual and visual knowledge into the diffusion model, improving the quality of images.
In case you missed it
The Chinese text-to-image diffusion model ERNIE-ViLG got upgraded to version 2.0, which makes the quality of the pictures higher and the tool itself – the biggest model at present. Developer Baidu managed to improve ERNIE-ViLG by incorporating textual and visual knowledge of key elements in the scene and utilizing different denoising experts at different denoising stages.
According to the creators, ERNIE-ViLG 2.0 achieves the stateof-the-art on Microsoft COCO (a dataset that helps recognize objects in a scene) with zero-shot FID score of 6.75. It also reportedly outperforms recent models in terms of image fidelity and image-text alignment.
The researchers employed a text parser and an object detector to extract key elements of the scene in the input text-image pair and guided the model to pay more attention to their alignment in the learning process. What's more, they divided the denoising steps into several stages and used denoising “experts” for each stage. This way, the model can involve more parameters and learn the data distribution of each denoising stage better, without increasing the inference time. ERNIE-ViLG 2.0 can scale up the model to 24 billion parameters, which is 10 times more than in Stable Diffusion, making it the largest text-to-image model at the time.