Abstract
Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In particular, we address the prepositional phrase attachment problem, a hard and semantic problem for syntactic parsers. Given an image and a caption, the proposed approach resolves syntactic attachment of prepositions in the parse tree according to both visual and lexical features. Visual features are derived from the nature and position of detected objects in the image that are aligned to textual phrases in the caption. A reranker uses this information to reorder syntactic trees produced by a shift-reduce syntactic parser. Trained on the Flickr-PP corpus which contains multimodal gold-standard attachments, this approach yields improvements over a text-only syntactic parser, in particular for the subset of prepositions that encode location, leading to an increase of up to 17 points of attachment accuracy.
Similar content being viewed by others
Notes
In order to take into account the fact that one word of the sentence, its root, has no governor, the common practice is to add a dummy word in front of the sentence and to consider that this dummy word is the governor of the linguistic root of the sentence. The identification of the root of the sentence is therefore viewed as the prediction of a dependency of a special kind.
Corresponds to the relationship between the area of the intersection and the area of union of two surfaces A and B: \({\text {IoU}}(A, B) = \frac{A \cap B}{A \cup B}\). This score measures how similar two surfaces are. A score of 1 meaning that the two surfaces are identical.
The Icsiboost classifier handles missing values by accounting for the class distribution given such values [51].
For readability reasons, a simplified part of speech tagset has been used in the regular expressions.
This corpus is available at https://github.com/sebastiendelecraz/pp-flickr.
We used the same hyper-parameters as the implementation: https://github.com/fartashf/vsepp.
References
Agirre E, Baldwin T, Martinez D (2008) Improving parsing and PP attachment performance with sense information. In: Proceedings of ACL-08: HLT. Association for Computational Linguistics, Columbus, Ohio, pp 317–325
Anguiano EH, Candito M (2011) Parse correction with specialized models for difficult attachment types. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1222–1233
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. In: International conference on computer vision (ICCV)
Attardi G, Ciaramita M (2007) Tree revision learning for dependency parsing. In: Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics; Proceedings of the main conference, pp 388–395
Belinkov Y, Lei T, Barzilay R, Globerson A (2014) Exploring compositional architectures and word vector representations for prepositional phrase attachment. Trans Assoc Comput Linguist 2:561–572
Caglayan O, Bardet A, Bougares F, Barrault L, Wang K, Masana M, Herranz L, van de Weijer J (2018) Lium-cvc submissions for wmt18 multimodal translation task. In: Proceedings of the third conference on machine translation, volume 2. Shared task papers. Association for Computational Linguistics, Belgium, Brussels, pp 603–608
Chang AX, Monroe W, Savva M, Potts C, Manning CD (2015) Text to 3d scene generation with rich lexical grounding. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian Federation of natural language processing, ACL 2015, pp 53–62
Chen Q, Zhu X, Ling ZH, Inkpen D, Wei S (2018) Neural natural language inference models enhanced with external knowledge. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), vol 1, pp 2406–2417
Chomsky N (1957) Syntactic structures. Walter de Gruyter, Berlin
Christie G, Laddha A, Agrawal A, Antol S, Goyal Y, Kochersberger K, Batra D (2016) Resolving language and vision ambiguities together: joint segmentation and prepositional attachment resolution in captioned scenes. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1493–1503
Cirik V, Berg-Kirkpatrick T, Morency LP (2018) Using syntax to ground referring expressions in natural images. In: Thirty-second AAAI conference on artificial intelligence
Coco MI, Keller F (2015) The interaction of visual and linguistic saliency during syntactic ambiguity resolution. Q J Exp Psychol 68(1):46–74
Coyne R, Sproat R (2001) Wordseye: an automatic text-to-scene conversion system. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, SIGGRAPH 2001, pp 487–496
Dasigi P, Ammar W, Dyer C, Hovy E (2017) Ontology-aware token embeddings for prepositional phrase attachment. In: Proceedings of the 55th annual meeting of the Association for Computational Linguistics (volume 1: long papers), vol 1, pp 2089–2098
de Kok D, Ma J, Dima C, Hinrichs E (2017) PP attachment: Where do we stand? In: Proceedings of the 15th conference of the european chapter of the Association for Computational Linguistics: volume 2, short papers, vol 2, pp 311–317
Delecraz S, Nasr A, Bechet F, Favre B (2017) Correcting prepositional phrase attachments using multimodal corpora. In: Proceedings of the 15th international conference on parsing technologies. Association for Computational Linguistics, Pisa, Italy, pp 72–77
Delecraz S, Nasr A, Bechet F, Favre B (2018) Adding syntactic annotations to Flickr30k Entities corpus for multimodal ambiguous prepositional-phrase attachment resolution. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC-2018). European Languages Resources Association (ELRA), Miyazaki, Japan
Delecraz S, Becerra-Bonache L, Nasr A, Béchet F, Favre B (2019) Visual disambiguation of prepositional phrase attachments: Multimodal machine learning for syntactic analysis correction. In: Rojas I, Joya G, Català A (eds) Advances in computational intelligence—15th international work-conference on artificial neural networks, IWANN 2019, Gran Canaria, Spain, June 12–14, 2019, Proceedings, Part I, Springer, Lecture Notes in Computer Science, vol 11506, pp 632–643
de Vries H, Strub F, Mary J, Larochelle H, Pietquin O, Courville AC (2017) Modulating early visual processing by language. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, Red Hook, pp 6594–6604
Eisner JM (1996) Three new probabilistic models for dependency parsing: an exploration. In: Proceedings of the 16th conference on computational linguistics-volume 1. Association for Computational Linguistics, pp 340–345
Elliott D, Frank S, Barrault L, Bougares F, Specia L (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. In: Proceedings of the second conference on machine translation, Copenhagen, Denmark
Faghri F, Fleet DJ, Kiros R, Fidler S (2017) VSE++: improved visual-semantic embeddings. CoRR. arXiv:1707.05612
Fang H, Gupta S, Iandola FN, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, pp 1473–1482
Favre B, Hakkani-Tür D, Cuendet S (2007) Icsiboost: an open-source implementation of boostexter. https://github.com/benob/icsiboost.git
Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Hall K, Novák V (2005) Corrective modeling for non-projective dependency parsing. In: Proceedings of the ninth international workshop on parsing technology. Association for Computational Linguistics, pp 42–52
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hudson RA (1984) Word grammar. Blackwell, Oxford
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition, CVPR, pp 3128–3137
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kolesnikov A, Kuznetsova A, Lampert C, Ferrari V (2019) Detecting visual relationships using box attention. In: Proceedings of the IEEE international conference on computer vision workshops, pp 0–0
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common objects in context. In: European conference on computer vision. Springer, pp 740–755
Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: the penn treebank. Comput Linguist 19(2):313–330
McDonald R, Crammer K, Pereira F (2005) Online large-margin training of dependency parsers. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05), pp 91–98
Meluk IA, Pertsov NV (1986) Surface syntax of English: a formal model within the meaning-text framework. John Benjamins, Amsterdam
Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41
Mirroshandel SA, Nasr A (2016) Integrating selectional constraints and subcategorization frames in a dependency parser. Comput Linguist
Nasr A, Béchet F, Rey JF, Favre B, Le Roux J (2011) Macaon: an nlp tool suite for processing word lattices. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: systems demonstrations. Association for Computational Linguistics, pp 86–91
Nivre J (2003) An efficient algorithm for projective dependency parsing. In: Proceedings of the eighth international conference on parsing technologies, pp 149–160
Nivre J, De Marneffe MC, Ginter F, Goldberg Y, Hajic J, Manning CD, McDonald R, Petrov S, Pyysalo S, Silveira N, et al. (2016) Universal dependencies v1: a multilingual treebank collection. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), pp 1659–1666
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations. In: IEEE international conference on computer vision, ICCV 2017, pp 5189–5198
Peyre J, Laptev I, Schmid C, Sivic J (2019) Detecting unseen visual relations using analogies. In: Proceedings of the IEEE international conference on computer vision, pp 1981–1990
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2017) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. Int J Comput Vis 123(1):74–93
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. in: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 6517–6525
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2–3):135–168
Shaerlaekens A (1973) The two-word sentence in child language development: a study Based on Evidence Provided by Dutch-Speaking Triplets. Mouton, The Hague
Shi H, Mao J, Gimpel K, Livescu K (2019) Visually grounded neural syntax acquisition. In: Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp 1842–1861
Snow CE (1972) Mothers’ speech to children learning language. Child Dev 43(2):549–565
Spivey MJ, Tanenhaus MK, Eberhard KM, Sedivy JC (2002) Eye movements and spoken language comprehension: effects of visual context on syntactic ambiguity resolution. Cogn Psychol 45(4):447–481
Tesnière L (1959) Eléments de syntaxe structurale. Klincksieck, Paris
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, pp 3156–3164
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yamada H, Matsumoto Y (2003) Statistical dependency analysis with support vector machines. In: Proceedings of the eighth international conference on parsing technologies
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV), pp 670–685
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Zhang J, Kalantidis Y, Rohrbach M, Paluri M, Elgammal A, Elhoseiny M (2019) Large-scale visual relationship understanding. Proc AAAI Conf Artif Intell 33:9185–9194
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work of Leonor Becerra-Bonache has been performed during her teaching leave granted by the CNRS (French National Center for Scientific Research) in Laboratoire d’Informatique et Systèmes of Aix-Marseille University.
Rights and permissions
About this article
Cite this article
Delecraz, S., Becerra-Bonache, L., Favre, B. et al. Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images. Neural Process Lett 53, 3095–3121 (2021). https://doi.org/10.1007/s11063-020-10314-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10314-8