Towards syntax-aware token embeddings

Diana Nicoleta Popa; Julien Perez; James Henderson; Eric Gaussier

doi:10.1017/S1351324920000297

Towards syntax-aware token embeddings

Published online by Cambridge University Press: 08 July 2020

Diana Nicoleta Popa ,

Julien Perez ,

James Henderson and

Eric Gaussier

Show author details

Diana Nicoleta Popa*: Affiliation:
Laboratoire d’Informatique de Grenoble, Université Grenoble Alpes, 700 Avenue Centrale, 38401Saint-Martin-d’Hères, France Naver Labs Europe, 6 Chemin de Maupertuis, 38240Meylan, France
Julien Perez: Affiliation:
Naver Labs Europe, 6 Chemin de Maupertuis, 38240Meylan, France
James Henderson: Affiliation:
Idiap Research Institute, 19 Rue Marconi, 1920Martigny, Switzerland
Eric Gaussier: Affiliation:
Laboratoire d’Informatique de Grenoble, Université Grenoble Alpes, 700 Avenue Centrale, 38401Saint-Martin-d’Hères, France
*: *Corresponding author. E-mail: diana.popa@imag.fr

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Distributional semantic word representations are at the basis of most modern NLP systems. Their usefulness has been proven across various tasks, particularly as inputs to deep learning models. Beyond that, much work investigated fine-tuning the generic word embeddings to leverage linguistic knowledge from large lexical resources. Some work investigated context-dependent word token embeddings motivated by word sense disambiguation, using sequential context and large lexical resources. More recently, acknowledging the need for an in-context representation of words, some work leveraged information derived from language modelling and large amounts of data to induce contextualised representations. In this paper, we investigate Syntax-Aware word Token Embeddings (SATokE) as a way to explicitly encode specific information derived from the linguistic analysis of a sentence in vectors which are input to a deep learning model. We propose an efficient unsupervised learning algorithm based on tensor factorisation for computing these token embeddings given an arbitrary graph of linguistic structure. Applying this method to syntactic dependency structures, we investigate the usefulness of such token representations as part of deep learning models of text understanding. We encode a sentence either by learning embeddings for its tokens and the relations between them from scratch or by leveraging pre-trained relation embeddings to infer token representations. Given sufficient data, the former is slightly more accurate than the latter, yet both provide more informative token embeddings than standard word representations, even when the word representations have been learned on the same type of context from larger corpora (namely pre-trained dependency-based word embeddings). We use a large set of supervised tasks and two major deep learning families of models for sentence understanding to evaluate our proposal. We empirically demonstrate the superiority of the token representations compared to popular distributional representations of words for various sentence and sentence pair classification tasks.

Keywords

Token embeddings Syntax-aware word representations Tensor factorisation

Type: Article
Information: Natural Language Engineering , Volume 27 , Issue 6 , November 2021 , pp. 691 - 720

DOI: https://doi.org/10.1017/S1351324920000297 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bahdanau, D., Cho, K. and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In Proceedings of the 2014 International Conference on Learning Representations.Google Scholar

Bansal, M., Gimpel, K. and Livescu, K. (2014). Tailoring continuous word representations for dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.CrossRef Google Scholar

Baroni, M. and Lenci, A. (2010). Distributional memory: a general framework for corpus-based semantics. Journal of Computational Linguistics 36(4), 673–721.CrossRef Google Scholar

Bengio, Y., Ducharme, R., Vincent, P. and Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155.Google Scholar

Bentivogli, L., Bernardi, R., Marelli, M., Menini, S., Baroni, M. and Zamparelli, R. (2016). Sick through the semeval glasses. lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Journal of Language Resources and Evaluation 50, 95–124.Google Scholar

Blacoe, W. and Lapata, M. (2012). A comparison of vector-based representations for semantic composition. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J. and Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26. Curran Associates Inc., pp. 2787–2795.Google Scholar

Bowman, S.R., Angeli, G., Potts, C. and Manning, C.D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.CrossRef Google Scholar

Chen, X., Liu, Z. and Sun, M. (2014). A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.CrossRef Google Scholar

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537.Google Scholar

Conneau, A. and Kiela, D. (2018). Senteval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).Google Scholar

Conneau, A., Kiela, D., Schwenk, H., Barrault, L. and Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.CrossRef Google Scholar

Dasigi, P., Ammar, W., Dyer, C. and Hovy, E.H. (2017). Ontology-aware token embeddings for prepositional phrase attachment. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.CrossRef Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar

Dolan, B. and Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing.Google Scholar

Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E.H. and Smith, N.A. (2015). Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar

Gehring, J., Auli, M., Grangier, D., Yarats, D. and Dauphin, Y. (2017). Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning.Google Scholar

Ghannay, S., Favre, B., Estève, Y. and Camelin, N. (2016). Word embedding evaluation and combination. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 300–305. European Language Resources Association (ELRA).Google Scholar

Grefenstette, E. and Sadrzadeh, M. (2011). Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.Google Scholar

Grefenstette, G. (1994). Explorations in Automatic Thesaurus Discovery. USA: Kluwer Academic Publishers.CrossRef Google Scholar

Henderson, J. (2003). Inducing history representations for broad coverage statistical parsing. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar

Honnibal, M. and Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Google Scholar

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computing 9(8), 1735–1780.CrossRef Google Scholar

Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.CrossRef Google Scholar

İrsoy, O. and Cardie, C. (2014). Deep recursive neural networks for compositionality in language. In Proceedings of the 27th International Conference on Neural Information Processing Systems.Google Scholar

Kalchbrenner, N., Grefenstette, E. and Blunsom, P. (2014). A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.CrossRef Google Scholar

Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.CrossRef Google Scholar

Kingma, D. and Ba, J. (2014). Adam: a method for stochastic optimization. In Proceedings of the 2014 International Conference on Learning Representations.Google Scholar

Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A. and Fidler, S. (2015). Skip-thought vectors. In Advances in Neural Information Processing Systems 28.Google Scholar

Levy, O. and Goldberg, Y. (2014a). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.CrossRef Google Scholar

Levy, O. and Goldberg, Y. (2014b). Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27. MIT Press, pp. 2177–2185.Google Scholar

Li, X. and Roth, D. (2002). Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics.CrossRef Google Scholar

Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 17th International Conference on Computational Linguistics - Volume 2.Google Scholar

Liu, P., Qiu, X. and Huang, X. (2015). Learning context-sensitive word embeddings with neural tensor skip-gram model. In Proceedings of the 24th International Conference on Artificial Intelligence.Google Scholar

Liu, Y., Sun, C., Lin, L. and Wang, X. (2016). Learning natural language inference using bidirectional lstm model and inner-attention. CoRR abs/1605.09090.Google Scholar

Marcus, M.P., Marcinkiewicz, M.A. & Santorini, B. (1993). Building a large annotated corpus of english: the penn treebank. Journal of Computational Linguistics - Special Issue on Using Large Corpora.CrossRef Google Scholar

McCann, B., Bradbury, J., Xiong, C. & Socher, R. (2017). Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems 30, pp. 6294–6305.Google Scholar

Melamud, O., Goldberger, J. and Dagan, I. (2016). context2vec: learning generic context embedding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.CrossRef Google Scholar

Mikolov, T., Chen, K., Corrado, G.S. & Dean, J. (2013a). Efficient estimation of word representations in vector space. CoRR abs/1301.3781.Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 3111–3119.Google Scholar

Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Journal of Cognitive Science 34, 1388–1429.CrossRef Google Scholar

Mrkšic, N., OSéaghdha, D., Thomson, B., Gašic, M., Rojas-Barahona, L., Su, P.-H., Vandyke, D., Wen, T.-H. & Young, S. (2016). Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar

Neelakantan, A., Shankar, J., Passos, A. and McCallum, A. (2014). Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.CrossRef Google Scholar

Nickel, M., Tresp, V. and Kriegel, H.-P. (2011). A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on Machine Learning. Omnipress, pp. 809–816.Google Scholar

Padó, S. and Lapata, M. (2007). Dependency-based construction of semantic space models. Journal of Computational Linguistics 33, 161–199.CrossRef Google Scholar

Pang, B. and Lee, L. (2004). A sentimental education: sentiment analysis using subjectivity. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.Google Scholar

Pang, B. and Lee, L. (2005). Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics.CrossRef Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.CrossRef Google Scholar

Peters, M., Ammar, W., Bhagavatula, C. and Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.CrossRef Google Scholar

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar

Salant, S. and Berant, J. (2018). Contextualized word representations for reading comprehension. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar

Socher, R., Bauer, J., Manning, C.D, and Ng, A.Y. (2013a). Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics.Google Scholar

Socher, R., Huang, E.H., Pennington, J., Ng, A.Y. and Manning, C.D. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th International Conference on Neural Information Processing Systems.Google Scholar

Socher, R., Huval, B., Manning, C.D. and Ng, A.Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar

Socher, R., Manning, C.D. and Ng, A.Y. (2010). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop.Google Scholar

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C. (2013b). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Google Scholar

Tang, J., Qu, M. and Mei, Q. (2015). Pte: predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.CrossRef Google Scholar

Trouillon, T., Welbl, J., Riedel, S., Gaussier, É. and Bouchard, G. (2016). Complex embeddings for simple link prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning.Google Scholar

Tu, L., Gimpel, K. and Livescu, K. (2017). Learning to embed words in context for syntactic tasks. CoRR abs/1706.02807.CrossRef Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser . and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 5998–6008.Google Scholar

Vulic, I., Mrksic, N., Reichart, R., Séaghdha, D.ó., Young, S.J. and Korhonen, A. (2017). Morph-fitting: fine-tuning word vector spaces with simple language-specific rules. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.CrossRef Google Scholar

Weir, D., Weeds, J., Reffin, J. and Kober, T. (2016). Aligning packed dependency trees: a theory of composition for distributional semantics. Journal of Computational Linguistics 42, 727–761.CrossRef Google Scholar

Westera, M. & Boleda, G. (2019). Don’t blame distributional semantics it can’t do entailment. In Proceedings of the 13th International Conference on Computational Semantics.CrossRef Google Scholar

Zhao, H., Lu, Z. & Poupart, P. (2015). Self-adaptive hierarchical sentence model. In Proceedings of the 24th International Conference on Artificial Intelligence.Google Scholar

Zou, W.Y., Socher, R., Cer, D. and Manning, C.D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Google Scholar

Article contents

Towards syntax-aware token embeddings

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests