What we see in a photograph: content selection for image captioning,The Visual Computer

当前位置： X-MOL 学术 › Vis. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

What we see in a photograph: content selection for image captioning
The Visual Computer ( IF 3.5 ) Pub Date : 2020-07-10 , DOI: 10.1007/s00371-020-01867-9
Georgios Barlas , Christos Veinidis , Avi Arampatzis

We propose and experimentally investigate the usefulness of several features for selecting image content (objects) suitable for image captioning. The approach taken explores three broad categories of features, namely geometric, conceptual, and visual. Experiments suggest that widely known geometric ‘rules’ in art–aesthetics or photography (such as the golden ratio or the rule-of-thirds) and facts about the human visual system (such as its wider horizontal angle than its vertical) provide no useful information for the task. Human captioners seem to prefer large, elongated (but not in the golden ratio) objects, positioned near the image center, irrespective of orientation. Conceptually, the preferred objects are either too specific or too general, and animate things are almost always mentioned; furthermore, some evidence is found for selecting diverse objects in order to achieve maximal image coverage in captions. Visual object features such as saliency, depth, edges, entropy, and contrast, are all found to provide useful information. Beyond evaluating features in isolation, we investigate how well these are combined by performing feature and feature-category ablation studies, leading to an effective set of features which can be proven useful for operational systems. Moreover, we propose alternative ways for feature engineering and evaluation, dealing with the drawbacks of the evaluation methodology proposed in past literature.

中文翻译：

我们在照片中看到的：图像字幕的内容选择

我们提出并通过实验研究了几个特征对于选择适合图像字幕的图像内容（对象）的有用性。所采用的方法探索了三大类特征，即几何特征、概念特征和视觉特征。实验表明，艺术美学或摄影中广为人知的几何“规则”（例如黄金比例或三分法）和关于人类视觉系统的事实（例如水平角比垂直角更宽）没有提供任何有用的信息。任务的信息。人类字幕员似乎更喜欢位于图像中心附近的大而细长的（但不是黄金比例）对象，而不管方向如何。从概念上讲，首选对象要么太具体，要么太笼统，几乎总是提到有生命的东西；此外，找到了一些证据，可以选择不同的对象以实现字幕中的最大图像覆盖率。视觉对象特征，例如显着性、深度、边缘、熵和对比度，都可以提供有用的信息。除了单独评估特征之外，我们还通过执行特征和特征类别消融研究来研究这些特征的结合程度，从而产生一组可被证明对操作系统有用的有效特征。此外，我们提出了特征工程和评估的替代方法，以解决过去文献中提出的评估方法的缺点。除了单独评估特征之外，我们还通过执行特征和特征类别消融研究来研究这些特征的结合程度，从而产生一组可被证明对操作系统有用的有效特征。此外，我们提出了特征工程和评估的替代方法，以解决过去文献中提出的评估方法的缺点。除了单独评估特征之外，我们还通过执行特征和特征类别消融研究来研究这些特征的结合程度，从而产生一组可被证明对操作系统有用的有效特征。此外，我们提出了特征工程和评估的替代方法，以解决过去文献中提出的评估方法的缺点。

更新日期：2020-07-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>