当前位置: X-MOL 学术Cognit. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dense-CaptionNet: a Sentence Generation Architecture for Fine-grained Description of Image Semantics
Cognitive Computation ( IF 5.4 ) Pub Date : 2020-03-02 , DOI: 10.1007/s12559-019-09697-1
I. Khurram , M. M. Fraz , M. Shahzad , N. M. Rajpoot

Automatic image captioning, a highly challenging research problem, aims to understand and describe the contents of the complex scene in human understandable natural language. The majority of the recent solutions are based on holistic approaches where the scene is described as a whole, potentially losing the important semantic relationship of objects in the scene. We propose Dense-CaptionNet, a region-based deep architecture for fine-grained description of image semantics, which localizes and describes each object/region in the image separately and generates a more detailed description of the scene. The proposed network contains three components which work together to generate a fine-grained description of image semantics. Region descriptions and object relationships are generated by the first module, whereas the second one generates the attributes of objects present in the scene. The textual descriptions obtained as an output of the two modules are concatenated to feed as an input to the sentence generation module, which works on encoder-decoder formulation to generate a grammatically correct but single line, fine-grained description of the whole scene. The proposed Dense-CaptionNet is trained and tested using Visual Genome, MSCOCO, and IAPR TC-12 datasets. The results establish a new state-of-the-art when compared with the existing top performing methodologies, e.g., Up-Down-Captioner, Show, Attend and Tell, Semstyle, and Neural Talk, especially on complex scenes. The implementation has been shared on GitHub for other researchers: http://bit.ly/2VIhfrf



中文翻译:

Dense-CaptionNet:用于图像语义细粒度描述的语句生成架构

自动图像字幕是一个极富挑战性的研究问题,旨在以人类可理解的自然语言来理解和描述复杂场景的内容。最近的大多数解决方案都基于整体方法,其中将场景描述为一个整体,可能会丢失场景中对象的重要语义关系。我们提出了Dense-CaptionNet,这是一种基于区域的深度体系结构,用于对图像语义进行细粒度的描述,该体系结构分别对图像中的每个对象/区域进行定位和描述,并生成更详细的场景描述。提议的网络包含三个组件,这些组件可以一起工作以生成对图像语义的细粒度描述。区域描述和对象关系由第一个模块生成,而第二个生成场景中存在的对象的属性。作为两个模块的输出而获得的文本描述被串联起来,作为句子生成模块的输入,该句子生成模块以编码器-解码器公式为基础,生成整个场景的语法正确但单行,细粒度的描述。拟议的Dense-CaptionNet使用Visual Genome,MSCOCO和IAPR TC-12数据集进行了培训和测试。与现有的性能最高的方法相比,结果建立了新的技术水平,例如 整个场景的详细描述。拟议的Dense-CaptionNet使用Visual Genome,MSCOCO和IAPR TC-12数据集进行了培训和测试。与现有的性能最高的方法相比,结果建立了新的技术水平,例如 整个场景的详细描述。拟议的Dense-CaptionNet使用Visual Genome,MSCOCO和IAPR TC-12数据集进行了培训和测试。与现有的性能最高的方法相比,结果建立了新的技术水平,例如上下字幕显示,出席和讲述SemstyleNeural Talk,特别是在复杂场景上。该实现已在GitHub上共享给其他研究人员:http://bit.ly/2VIhfrf

更新日期:2020-04-20
down
wechat
bug