A Comprehensive Pipeline for Complex Text-to-Image Synthesis,Journal of Computer Science and Technology

当前位置： X-MOL 学术 › J. Comput. Sci. Tech. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Comprehensive Pipeline for Complex Text-to-Image Synthesis
Journal of Computer Science and Technology ( IF 1.2 ) Pub Date : 2020-05-01 , DOI: 10.1007/s11390-020-0305-9
Fei Fang , Fei Luo , Hong-Pan Zhang , Hua-Jian Zhou , Alix L. H. Chow , Chun-Xia Xiao

Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem. It needs to solve several difficult tasks across the fields of natural language processing and computer vision. We model it as a combination of semantic entity recognition, object retrieval and recombination, and objects’ status optimization. To reach a satisfactory result, we propose a comprehensive pipeline to convert the input text to its visual counterpart. The pipeline includes text processing, foreground objects and background scene retrieval, image synthesis using constrained MCMC, and post-processing. Firstly, we roughly divide the objects parsed from the input text into foreground objects and background scenes. Secondly, we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset, and retrieve an appropriate background scene image from the background image dataset extracted from the Internet. Thirdly, in order to ensure the rationality of foreground objects’ positions and sizes in the image synthesis step, we design a cost function and use the Markov Chain Monte Carlo (MCMC) method as the optimizer to solve this constrained layout problem. Finally, to make the image look natural and harmonious, we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step. The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks (GANs) in visual quality of generated scene images.

中文翻译：

复杂的文本到图像合成的综合管道

根据文本描述合成具有多个对象和背景的复杂场景图像是一个具有挑战性的问题。它需要解决跨自然语言处理和计算机视觉领域的几个困难任务。我们将其建模为语义实体识别、对象检索和重组以及对象状态优化的组合。为了达到令人满意的结果，我们提出了一个综合的管道来将输入文本转换为其视觉对应物。管道包括文本处理、前景对象和背景场景检索、使用受约束 MCMC 的图像合成和后处理。首先，我们粗略地把从输入文本中解析出来的物体分为前景物体和背景场景。第二，我们从 Microsoft COCO 数据集分割的前景对象数据集中检索所需的前景对象，并从互联网提取的背景图像数据集中检索适当的背景场景图像。第三，为了保证图像合成步骤中前景物体位置和尺寸的合理性，我们设计了一个代价函数，并使用马尔可夫链蒙特卡罗（MCMC）方法作为优化器来解决这个约束布局问题。最后，为了使图像看起来自然和谐，我们在后期处理步骤中进一步使用基于泊松和重新照明的方法来混合前景对象和背景场景图像。

更新日期：2020-05-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11