Skip to main content
Log in

Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In particular, we address the prepositional phrase attachment problem, a hard and semantic problem for syntactic parsers. Given an image and a caption, the proposed approach resolves syntactic attachment of prepositions in the parse tree according to both visual and lexical features. Visual features are derived from the nature and position of detected objects in the image that are aligned to textual phrases in the caption. A reranker uses this information to reorder syntactic trees produced by a shift-reduce syntactic parser. Trained on the Flickr-PP corpus which contains multimodal gold-standard attachments, this approach yields improvements over a text-only syntactic parser, in particular for the subset of prepositions that encode location, leading to an increase of up to 17 points of attachment accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. In order to take into account the fact that one word of the sentence, its root, has no governor, the common practice is to add a dummy word in front of the sentence and to consider that this dummy word is the governor of the linguistic root of the sentence. The identification of the root of the sentence is therefore viewed as the prediction of a dependency of a special kind.

  2. Corresponds to the relationship between the area of the intersection and the area of union of two surfaces A and B: \({\text {IoU}}(A, B) = \frac{A \cap B}{A \cup B}\). This score measures how similar two surfaces are. A score of 1 meaning that the two surfaces are identical.

  3. The Icsiboost classifier handles missing values by accounting for the class distribution given such values [51].

  4. For readability reasons, a simplified part of speech tagset has been used in the regular expressions.

  5. This corpus is available at https://github.com/sebastiendelecraz/pp-flickr.

  6. We used the same hyper-parameters as the implementation: https://github.com/fartashf/vsepp.

References

  1. Agirre E, Baldwin T, Martinez D (2008) Improving parsing and PP attachment performance with sense information. In: Proceedings of ACL-08: HLT. Association for Computational Linguistics, Columbus, Ohio, pp 317–325

  2. Anguiano EH, Candito M (2011) Parse correction with specialized models for difficult attachment types. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1222–1233

  3. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. In: International conference on computer vision (ICCV)

  4. Attardi G, Ciaramita M (2007) Tree revision learning for dependency parsing. In: Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics; Proceedings of the main conference, pp 388–395

  5. Belinkov Y, Lei T, Barzilay R, Globerson A (2014) Exploring compositional architectures and word vector representations for prepositional phrase attachment. Trans Assoc Comput Linguist 2:561–572

    Article  Google Scholar 

  6. Caglayan O, Bardet A, Bougares F, Barrault L, Wang K, Masana M, Herranz L, van de Weijer J (2018) Lium-cvc submissions for wmt18 multimodal translation task. In: Proceedings of the third conference on machine translation, volume 2. Shared task papers. Association for Computational Linguistics, Belgium, Brussels, pp 603–608

  7. Chang AX, Monroe W, Savva M, Potts C, Manning CD (2015) Text to 3d scene generation with rich lexical grounding. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian Federation of natural language processing, ACL 2015, pp 53–62

  8. Chen Q, Zhu X, Ling ZH, Inkpen D, Wei S (2018) Neural natural language inference models enhanced with external knowledge. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), vol 1, pp 2406–2417

  9. Chomsky N (1957) Syntactic structures. Walter de Gruyter, Berlin

    Book  Google Scholar 

  10. Christie G, Laddha A, Agrawal A, Antol S, Goyal Y, Kochersberger K, Batra D (2016) Resolving language and vision ambiguities together: joint segmentation and prepositional attachment resolution in captioned scenes. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1493–1503

  11. Cirik V, Berg-Kirkpatrick T, Morency LP (2018) Using syntax to ground referring expressions in natural images. In: Thirty-second AAAI conference on artificial intelligence

  12. Coco MI, Keller F (2015) The interaction of visual and linguistic saliency during syntactic ambiguity resolution. Q J Exp Psychol 68(1):46–74

    Article  Google Scholar 

  13. Coyne R, Sproat R (2001) Wordseye: an automatic text-to-scene conversion system. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, SIGGRAPH 2001, pp 487–496

  14. Dasigi P, Ammar W, Dyer C, Hovy E (2017) Ontology-aware token embeddings for prepositional phrase attachment. In: Proceedings of the 55th annual meeting of the Association for Computational Linguistics (volume 1: long papers), vol 1, pp 2089–2098

  15. de Kok D, Ma J, Dima C, Hinrichs E (2017) PP attachment: Where do we stand? In: Proceedings of the 15th conference of the european chapter of the Association for Computational Linguistics: volume 2, short papers, vol 2, pp 311–317

  16. Delecraz S, Nasr A, Bechet F, Favre B (2017) Correcting prepositional phrase attachments using multimodal corpora. In: Proceedings of the 15th international conference on parsing technologies. Association for Computational Linguistics, Pisa, Italy, pp 72–77

  17. Delecraz S, Nasr A, Bechet F, Favre B (2018) Adding syntactic annotations to Flickr30k Entities corpus for multimodal ambiguous prepositional-phrase attachment resolution. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC-2018). European Languages Resources Association (ELRA), Miyazaki, Japan

  18. Delecraz S, Becerra-Bonache L, Nasr A, Béchet F, Favre B (2019) Visual disambiguation of prepositional phrase attachments: Multimodal machine learning for syntactic analysis correction. In: Rojas I, Joya G, Català A (eds) Advances in computational intelligence—15th international work-conference on artificial neural networks, IWANN 2019, Gran Canaria, Spain, June 12–14, 2019, Proceedings, Part I, Springer, Lecture Notes in Computer Science, vol 11506, pp 632–643

  19. de Vries H, Strub F, Mary J, Larochelle H, Pietquin O, Courville AC (2017) Modulating early visual processing by language. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, Red Hook, pp 6594–6604

    Google Scholar 

  20. Eisner JM (1996) Three new probabilistic models for dependency parsing: an exploration. In: Proceedings of the 16th conference on computational linguistics-volume 1. Association for Computational Linguistics, pp 340–345

  21. Elliott D, Frank S, Barrault L, Bougares F, Specia L (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. In: Proceedings of the second conference on machine translation, Copenhagen, Denmark

  22. Faghri F, Fleet DJ, Kiros R, Fidler S (2017) VSE++: improved visual-semantic embeddings. CoRR. arXiv:1707.05612

  23. Fang H, Gupta S, Iandola FN, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, pp 1473–1482

  24. Favre B, Hakkani-Tür D, Cuendet S (2007) Icsiboost: an open-source implementation of boostexter. https://github.com/benob/icsiboost.git

  25. Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612

    Google Scholar 

  26. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  27. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  28. Hall K, Novák V (2005) Corrective modeling for non-projective dependency parsing. In: Proceedings of the ninth international workshop on parsing technology. Association for Computational Linguistics, pp 42–52

  29. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  30. Hudson RA (1984) Word grammar. Blackwell, Oxford

    Google Scholar 

  31. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition, CVPR, pp 3128–3137

  32. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  33. Kolesnikov A, Kuznetsova A, Lampert C, Ferrari V (2019) Detecting visual relationships using box attention. In: Proceedings of the IEEE international conference on computer vision workshops, pp 0–0

  34. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  35. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common objects in context. In: European conference on computer vision. Springer, pp 740–755

  36. Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: the penn treebank. Comput Linguist 19(2):313–330

    Google Scholar 

  37. McDonald R, Crammer K, Pereira F (2005) Online large-margin training of dependency parsers. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05), pp 91–98

  38. Meluk IA, Pertsov NV (1986) Surface syntax of English: a formal model within the meaning-text framework. John Benjamins, Amsterdam

    Book  Google Scholar 

  39. Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41

    Article  Google Scholar 

  40. Mirroshandel SA, Nasr A (2016) Integrating selectional constraints and subcategorization frames in a dependency parser. Comput Linguist

  41. Nasr A, Béchet F, Rey JF, Favre B, Le Roux J (2011) Macaon: an nlp tool suite for processing word lattices. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: systems demonstrations. Association for Computational Linguistics, pp 86–91

  42. Nivre J (2003) An efficient algorithm for projective dependency parsing. In: Proceedings of the eighth international conference on parsing technologies, pp 149–160

  43. Nivre J, De Marneffe MC, Ginter F, Goldberg Y, Hajic J, Manning CD, McDonald R, Petrov S, Pyysalo S, Silveira N, et al. (2016) Universal dependencies v1: a multilingual treebank collection. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), pp 1659–1666

  44. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  45. Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations. In: IEEE international conference on computer vision, ICCV 2017, pp 5189–5198

  46. Peyre J, Laptev I, Schmid C, Sivic J (2019) Detecting unseen visual relations using analogies. In: Proceedings of the IEEE international conference on computer vision, pp 1981–1990

  47. Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2017) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. Int J Comput Vis 123(1):74–93

    Article  MathSciNet  Google Scholar 

  48. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. in: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 6517–6525

  49. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  50. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  51. Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2–3):135–168

    Article  Google Scholar 

  52. Shaerlaekens A (1973) The two-word sentence in child language development: a study Based on Evidence Provided by Dutch-Speaking Triplets. Mouton, The Hague

    Google Scholar 

  53. Shi H, Mao J, Gimpel K, Livescu K (2019) Visually grounded neural syntax acquisition. In: Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp 1842–1861

  54. Snow CE (1972) Mothers’ speech to children learning language. Child Dev 43(2):549–565

    Article  Google Scholar 

  55. Spivey MJ, Tanenhaus MK, Eberhard KM, Sedivy JC (2002) Eye movements and spoken language comprehension: effects of visual context on syntactic ambiguity resolution. Cogn Psychol 45(4):447–481

    Article  Google Scholar 

  56. Tesnière L (1959) Eléments de syntaxe structurale. Klincksieck, Paris

    Google Scholar 

  57. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  58. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, pp 3156–3164

  59. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  60. Yamada H, Matsumoto Y (2003) Statistical dependency analysis with support vector machines. In: Proceedings of the eighth international conference on parsing technologies

  61. Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV), pp 670–685

  62. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  63. Zhang J, Kalantidis Y, Rohrbach M, Paluri M, Elgammal A, Elhoseiny M (2019) Large-scale visual relationship understanding. Proc AAAI Conf Artif Intell 33:9185–9194

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastien Delecraz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of Leonor Becerra-Bonache has been performed during her teaching leave granted by the CNRS (French National Center for Scientific Research) in Laboratoire d’Informatique et Systèmes of Aix-Marseille University.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Delecraz, S., Becerra-Bonache, L., Favre, B. et al. Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images. Neural Process Lett 53, 3095–3121 (2021). https://doi.org/10.1007/s11063-020-10314-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10314-8

Keywords

Navigation