Inferring spatial relations from textual descriptions of images

doi:10.1016/j.patcog.2021.107847

Pattern Recognition

Volume 113, May 2021, 107847

https://doi.org/10.1016/j.patcog.2021.107847 Get rights and content

Highlights

•
A novel dataset (REC-COCO) for spatial inference from text, where textual tokens are linked to bounding boxes in images.
•
Experiments that prove the contextual information of textual captions helps inferring the spatial relation between objects.
•
We present an experimental analysis of various scenarios to infer spatial relations from text.
•
Qualitative results that suggest that neural network architectures can learn prototypical spatial relations between objects.

Abstract

Generating an image from its textual description requires both a certain level of language understanding and common sense knowledge about the spatial relations of the physical entities being described. In this work, we focus on inferring the spatial relation between entities, a key step in the process of composing scenes based on text. More specifically, given a caption containing a mention to a subject and the location and size of the bounding box of that subject, our goal is to predict the location and size of an object mentioned in the caption. Previous work did not use the caption text information, but a manually provided relation holding between the subject and the object. In fact, the used evaluation datasets contain manually annotated ontological triplets but no captions, making the exercise unrealistic: a manual step was required; and systems did not leverage the richer information in captions. Here we present a system that uses the full caption, and Relations in Captions (REC-COCO), a dataset derived from MS-COCO which allows to evaluate spatial relation inference from captions directly. Our experiments show that: (1) it is possible to infer the size and location of an object with respect to a given subject directly from the caption; (2) the use of full text allows to place the object better than using a manually annotated relation. Our work paves the way for systems that, given a caption, decide which entities need to be depicted and their respective location and sizes, in order to then generate the final image.

Introduction

The ability of automatically generating images from textual descriptions is a fundamental skill which can boost many relevant applications, such as art generation and computer-aided design. From a scientific point of view, it also drives research progress in multimodal learning and inference across vision and language, which is currently a very active research area [1]. In the case of scenes comprising several entities, it is necessary to infer which is an adequate scene layout, i.e., which entities to show, their location and size.

From the language understanding perspective, in order to generate realistic images from textual descriptions, it is necessary to infer visual features and relations between the entities mentioned in the text. For example, given the text “a black cat on a table”, an automatic system has to understand that the cat has a certain colour (black) and is situated on top of the table, among other details. In this paper, we focus on the spatial relations between the entities, since they are the key to suitably compose scenes described in texts. The spatial information is sometimes given explicitly, in form of prepositions (“cat on a table”), but more often implicitly, since the verb used to relate two entities contains information about the spatial arrangement of both. For example, from the text (“a woman riding a horse”) it is obvious for humans that the woman is on top of the horse. However, acquiring such spatial relations from text is far from trivial, as this kind of common sense spatial knowledge is rarely stated explicitly in natural language text [2]. That is precisely what text-to-image systems learn, relating both explicit and implicit spatial relations expressed in text with actual visual arrangements showed in images.

A large strand of research in text-to-image generation are evaluated according to the pixel-based quality of the generated images and the global fidelity to the textual descriptions, but do not evaluate whether the entities have been arranged according to the spatial relations mentioned in the text [3]. Closer to our goal, some researchers do focus on learning spatial relations between entities [4], [5], [6], [7], [8], [9]. For instance, in Gupta and Malik [6], Krishna et al. [8] the authors proposed to associate actions along with their semantic arguments (subject and object) with pixels in images (i.e., bounding boxes of entities) as a way towards understanding the images. V-COCO is a dataset which comprises images and manually created Subject, Relation, Object ( $S, R, O$ ) ontological triplets, henceforth called concept triplets, where each $S$ and $O$ is associated with a bounding box in the image [6]. Note that the terms used to describe the triplet concepts are selected manually from among a small vocabulary¹ of an ontology, e.g. PERSON or BOOK, and are not linked to the words in the caption. Visual Genome is constructed similarly [8]. Typically, those datasets are created by showing images to human annotators, and asking them to locate the bounding boxes of the entities participating on predefined relations, and to select the terms for the relation and entities from a reduced vocabulary in a small ontology. Using such a dataset, Collell et al. [5] presents a system that uses concept triplets to infer the spatial relation between the subject $S$ and the object $O$ . Given the bounding box of the subject, the system outputs the location and size of the bounding box of the object. Evaluation is done checking whether the predicted bounding box matches the actual bounding box in the image. The datasets and systems in the previous work require the use of manually extracted ontological triplets, and systems did not use the actual captions, posing two issues: a manual pre-processing step was required; and systems did not use the richer information in captions.

In this paper we propose to study the use of full captions instead of manually selected relations when inferring the spatial relations between two entities, where one of them is considered the subject and the other is the object of the action being described by the relation. The problem we address is depicted in Fig. 1. Given a textual description of an image and the location (bounding box) of the subject of the action in the description, we want the system to predict the bounding box of the target object. Note that we do not use the actual pixels for this task, but we include Fig. 2 for illustrative purposes. To the best of our knowledge, there is no previous work addressing the same problem, i.e. nobody studied before whether using full captions instead of concept triplets benefits spatial relation inference.

Our hypothesis is that the textual description² accompanying the image contains information that helps inferring the spatial relations of two entities. We argue that the information presented in manually created triplets alone is often insufficient to properly infer spatial relations. As a motivation, Fig. 2 shows pairs of examples (left and right) where the relation between the subject and the object (given by a verb) is not enough to correctly predict the spatial relation between them. In each row there are two examples for the same subject, relation and object (e.g. person, reading, book), but the spatial relation between subject and object is different, and depends on the interpretation of the rest of the caption. For instance, in the top-left caption the person is sitting while it is reading a book, so that the book is around the middle of the bounding box for the person, while in the top-right caption the person is laying in bed, and therefore the book is slightly above the person.

To validate the main hypothesis of our work, we created a new dataset called Relations in Captions (REC-COCO) that contains associations between caption tokens and bounding boxes in images. REC-COCO is based on the MS-COCO [10] and V-COCO datasets [6]. For each image in V-COCO, we collect their corresponding captions from MS-COCO and automatically align the concept triplet in V-COCO to the tokens in the caption. This requires finding the token for concepts such as PERSON. As a result, REC-COCO contains the captions and the tokens which correspond to each subject and object, as well as the bounding boxes for the subject and object (cf. Fig. 3).

In addition, we have adapted a well-known state-of-the-art architecture that worked on concept triplets [5] to work also with full captions, and performed experiments which show that: (1) It is possible to infer the size and location of an object with respect to a given subject directly from the caption; (2) The use of the full text of the caption allows to place the object better than using the manually extracted relation.

The main contributions of the work are the following:

•
We show for the first time that the textual description includes information that is complementary to the relation between a subject and an object. From another perspective, our work shows that, given a caption, a reference subject and an object in the caption, our system can assign a location and a size to the object using the information in the caption, without any manually added relation.
•
We introduce a new dataset created for this task. The dataset comprises pairs of images and captions, including, for each pair, the tokens in the caption that describe the subject and object, and the bounding boxes of subject and object. The dataset is publicly available under a free license.³

Section snippets

Related work

Understanding the spatial relations between entities and their distribution in space is essential to solve several tasks such as human-machine collaboration [11] or text-to-scene synthesis [7], [12], [13], and has attracted the attention of different research communities. In this section, we will provide the different approaches to infer spatial relations among entities, evaluation methodologies arisen from those communities and available resources such as datasets.

REC-COCO dataset

The main goal of this paper is to extract spatial relations among the entities mentioned in image captions. To the best of our knowledge, there exists no dataset that contains explicit correspondences between image pixels (bounding boxes of entities) and their respective mentions in the image descriptions. We thus developed a new dataset, called Relations in Captions (REC-COCO), that contains such correspondences. REC-COCO is derived from MS-COCO [10] and V-COCO [6]. The former is a collection

Model for inferring spatial relations from captions

The problem addressed in this work (cf. Fig. 3) is the following: given a caption, a subject token in the caption ( $S$ ), the location and size of the bounding box for the subject, and a target object ( $O$ ), the system needs to predict sensible location and size of the bounding box for the object. More formally, we denote as $O^{c} = [O_{x}^{c}, O_{y}^{c}] \in R^{2}$ the $(x, y)$ coordinates of the center of the bounding box covering the object $O,$ and $O^{b} = [O_{x}^{b}, O_{y}^{b}] \in R^{2}$ half of its width and height. Thus, we use $O = [O^{c}, O^{b}] \in R^{4}$ as the

Experiments

In this section we report the results of the performed experiments. We conduct several sets of experiments, depending on the research question addressed. In the first set we assess the validity and quality of the REC-COCO dataset, complementing the analysis presented in Section 3. In the second set, we study which encoder is the most effective for solving this task. In a third set, we check whether it is possible to infer the size and location of an object with respect to a given subject

Conclusions

In this paper, we show that using the full textual descriptions of images improves the ability to model the spatial relationships between entities. Previous research has focused on using structured concept triplets which include an ontological representation of the relation, but we show that the caption contains additional useful information which our system uses effectively to improve results. Our experiments are based on REC-COCO, a new dataset that we have automatically derived from MS-COCO

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Aitzol Elu has been supported by a ETORKIZUNA ERAIKIZ grant from the Provincial Council of Gipuzkoa. This research has been partially funded by the Basque Government excellence research group (IT1343-19), the Spanish MINECO (FuturAAL RTI2018-101045-B-C21, DeepReading RTI2018-096846-B-C21 MCIU/AEI/FEDER, EU), Project BigKnowledge (Ayudas Fundación BBVA a equipos de investigación científica 2018), and the NVIDIA GPU grant program.

Aitzol Elu is a member of the IXA NLP group. He is a B.Sc. in Computer Science from the University of Basque Country (UPV/EHU), where he also obtained the M.Sc. degree in Language Analysis and Processing. His research interests include NLP and multimodal deep learning.

References (39)

N. Tandon et al.
Acquiring comparative commonsense knowledge from the web
Twenty-Eighth AAAI Conference on Artificial Intelligence
(2014)
A. Mogadala, M. Kalimuthu, D. Klakow, Trends in integration of vision and language research: a survey of tasks,...
B.D.V. Dume
Extracting implicit knowledge from text
(2010)
S. Reed et al.
Generative adversarial text to image synthesis
H. Bagherinezhad et al.
Are elephants bigger than butterflies? Reasoning about sizes of objects
Thirtieth AAAI Conference on Artificial Intelligence
(2016)
G. Collell et al.
Acquiring common sense spatial knowledge through implicit spatial templates
Thirty-Second AAAI Conference on Artificial Intelligence
(2018)
S. Gupta, J. Malik, Visual semantic role labeling, arXiv preprint:...
F. Huang et al.
Sketchforme: composing sketched scenes from text descriptions for interactive applications
Proceedings of the 32Nd Annual ACM Symposium on User Interface Software and Technology, UIST ’19
(2019)
R. Krishna et al.
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Int. J. Comput. Vis.
(2017)
M. Malinowski, M. Fritz, A pooling approach to modelling spatial relations for image retrieval and annotation, arXiv...

T.-Y. Lin et al.

Microsoft coco: common objects in context

European Conference on Computer Vision

(2014)

S. Guadarrama et al.

Grounding spatial relations for human-robot interaction

2013 IEEE/RSJ International Conference on Intelligent Robots and Systems

(2013)

T. Hinz et al.

Generating multiple objects at spatially distinct locations

International Conference on Learning Representations

(2019)

A.A. Jyothi et al.

LayoutVAE: stochastic scene layout generation from a label set

The IEEE International Conference on Computer Vision (ICCV)

(2019)

G.-J. Kruijff et al.

Situated dialogue and spatial organization: what, where and why?

Int. J. Adv. Robot. Syst.

(2007)

G. Platonov et al.

Computational models for spatial prepositions

Proceedings of the First International Workshop on Spatial Language Understanding

(2018)

I. Goodfellow et al.

Generative adversarial nets

Advances in Neural Information Processing Systems

(2014)

T. Xu et al.

AttnGAN: fine-grained text to image generation with attentional generative adversarial networks

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2018)

S.E. Reed et al.

Learning what and where to draw

Cited by (4)

Grounding spatial relations in text-only language models
2024, Neural Networks
This paper shows that text-only Language Models (LM) can learn to ground spatial relations like left of or below if they are provided with explicit location information of objects and they are properly trained to leverage those locations. We perform experiments on a verbalized version of the Visual Spatial Reasoning (VSR) dataset, where images are coupled with textual statements which contain real or fake spatial relations between two objects of the image. We verbalize the images using an off-the-shelf object detector, adding location tokens to every object label to represent their bounding boxes in textual form. Given the small size of VSR, we do not observe any improvement when using locations, but pretraining the LM over a synthetic dataset automatically derived by us improves results significantly when using location tokens. We thus show that locations allow LMs to ground spatial relations, with our text-only LMs outperforming Vision-and-Language Models and setting the new state-of-the-art for the VSR dataset. Our analysis show that our text-only LMs can generalize beyond the relations seen in the synthetic dataset to some extent, learning also more useful information than that encoded in the spatial rules we used to create the synthetic dataset itself.
Exploring semantic segmentation of related subclasses from a superset of classes
2022, Pattern Recognition
Citation Excerpt :
Deep learning has other miscellaneous applications too, like cough sound detection [21,22] and other real-life problems. MS-COCO dataset has been widely used in pattern recognition ranging from multi-image classification [23] to attention-based models for image captioning [24] and inferring spatial relations from textual descriptions of images [25]. Several models have been trained using MS-COCO Stuff Dataset.
Image segmentation is a very important topic in the field of computer vision. We present a method for semantic segmentation of selected stuff classes from a superset of classes. We show that in situations where only select stuff classes are required if we group them as per a strategy then it can attain much higher accuracy than the models trained on the original dataset with all classes intact. The COCO-Stuff Dataset is used for demonstrating the aforesaid strategy. For training purposes, the DeepLabv3+ with Mobilenet-v2 architecture is used. We have achieved an 80.2 percent mean Intersection over Union (mIoU) on these selected classes. We also refine the masks using Learning/Computer Vision (CV) methods and hence obtain better visualization results as compared to the existing DeepLabv3+ results.
GROUNDING SPATIAL RELATIONS IN TEXT-ONLY LANGUAGE MODELS
2024, arXiv
State-of-the-Art in Language Technology and Language-centric Artificial Intelligence
2023, Cognitive Technologies

Gorka Azkune is assistant professor in the University of Basque Country (UPV/ EHU). He has published over 20 international peer-reviewed articles in journals and international conferences. He is a member of the IXA NLP group. His research interests include machine learning and multimodal deep learning. He received a Ph.D. in Computer Science from the University of Deusto ([email protected])

Eneko Agirre is full professor in the University of the Basque Country (UPV/EHU). He has published over 150 international peer-reviewed articles and conference papers in NLP. He has been secretary and president of the ACL SIGLEX, member of the editorial board of Computational Linguistics, and is currently action editor of Transactions of the ACL. He received three Google Research Awards and two best paper awards ([email protected])

Oier Lopez de Lacalle is a postdoctoral researcher in the University of Basque Country (UPV/EHU). He has published over 40 international peerreviewed articles in journals and international conferences. He is a member of the IXA NLP group. His research interests include natural language processing and multimodal deep learning. He received a Ph.D. in Computer Science from the University of the Basque Country ([email protected])

Ignacio Arganda-Carreras is an Ikerbasque Research Fellow at the University of the Basque Country (UPV/EHU), in San Sebastian, Spain. His research interests include computer vision and bioimage analysis. He received a Ph.D. in computer science and electrical engineering from the Universidad Autonoma de Madrid, Spain ([email protected]).

Aitor Soroa is associate professor in the University of Basque Country (UPV/EHU). He has published over 40 international peer-reviewed articles in journals and international conferences. He is a member of the IXA NLP group. His research interests include lexical semantics, machine learning and multimodal deep learning ([email protected]).

View full text