Abstract
We present Lexifield, a fully automatic language-independent system for building domain-specific lexicons from a short list of terms defining the domain. Lexifield relies on a pre-trained word embedding model, a definition dictionary and a dictionary of synonyms. To evaluate this system, four lexicons have been generated: one lexicon in French for the topic “son” (“sound”) and three lexicons in English for the topics “sound”, “taste” and “odour”. As compared to other word embedding-based systems and a state-of-the-art sensorial lexicon, Sensicon, our system achieves better precision and recall on reference lists extracted from manually created resources such as Roget’s Thesaurus.
Similar content being viewed by others
Notes
As the embedding of a word depends on the context where it is used and thus on its POS tag, in the experiments, we distinguish taste\(_{noun}\) and taste\(_{verb}\) for instance.
References
Al-Shalabi R, Kanaan G (2004) Constructing an automatic lexicon for arabic language. Int J Comput Inf Sci 2(2):114–128
Amsler RA (1981) A taxonomy for English nouns and verbs. In: Proceedings of the 19th annual meeting, Association for Computational Linguistics, pp 133–138
Azad HK, Deepak A (2019) Query expansion techniques for information retrieval: a survey. Inf Process Manag 56(5):1698–1735
Baker CF, Fillmore CJ, Lowe JB (1998) The Berkeley framenet project. In: Proceedings of the 17th international conference on computational linguistics, vol1, Association for Computational Linguistics, pp 86–90
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL pp 31–40
Calzolari N (1984) Detecting patterns in a lexical data base. In: Proceedings of the 10th international conference on computational linguistics, COLING ’84, Association for Computational Linguistics, Stroudsburg, PA, USA, pp 170–173. https://doi.org/10.3115/980431.980527
Chodorow MS, Byrd RJ, Heidorn GE (1985) Extracting semantic hierarchies from a large on-line dictionary. In: Proceedings of the 23rd annual meeting, Association for Computational Linguistics, pp 299–304
Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Copestake A (1990) An approach to building the hierarchical element of a lexical knowledge base from a machine readable dictionary. In: First international workshop on inheritance in NLP
Dubois J, Dubois-Charlier F (2010) La combinatoire lexico-syntaxique dans le dictionnaire électronique des mots. les termes du domaine de la musique à titre d’illustration. Langages 179–180(3):31–56
Dubois J, Dubois-Charlier F (1997) Les Verbes français. Larousse, Paris
Fang H (2008) A re-examination of query expansion using lexical resources. In: Proceedings of ACL-08: HLT, pp 139–147
Fast E, Chen B, Bernstein MS (2016) Empath: understanding topic signals in large-scale text. In: Proceedings of the 2016 CHI conference on human factors in computing systems, ACM, pp 4647–4657
Fellbaum C (1998) WordNet: an electronic lexical database. Bradford Books, Cambridge
Globerson A, Chechik G, Pereira F, Tishby N (2007) Euclidean embedding of co-occurrence data. J Mach Learn Res 8:2265–2295
Jakubíček M, Kilgarriff A, Kovář V, Rychlỳ P, Suchomel V (2013) The tenten corpus family. In: 7th International corpus linguistics conference, CL, pp 125–127
Kotov A, Zhai C (2012) Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for difficult queries. In: Proceedings of the fifth ACM international conference on Web search and data mining, ACM, pp 403–412
Kuzi S, Shtok A, Kurland O (2016) Query expansion using word embeddings. In: Proceedings of the 25th ACM international on conference on information and knowledge management, ACM, pp 1929–1932
Lavelli A, Sebastiani F, Zanoli R (2004) Distributional term representations: an experimental comparison. In: Proceedings of the thirteenth ACM international conference on information and knowledge management, pp 615–624
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 27th international conference on neural information processing systems, vol. 2, NIPS’14, pp 2177–2185
Liu S, Liu F, Yu C, Meng W (2004) An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 266–272
Manguin JL (2004) Transitivité partielle de la synonymie: application aux dictionnaires de synonymes. Corela—cognition, représentation, langage
Markowitz J, Ahlswede T, Evens M (1986) Semantically significant patterns in dictionary definitions. In: 24th Annual meeting of the association for computational linguistics. http://aclweb.org/anthology/P86-1018
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mitchell J, Lapata M (2010) Composition in distributional models of semantics. Cognit Sci 34(8):1388–1429
Park D, Kim S, Lee J, Choo J, Diakopoulos N, Elmqvist N (2018) Conceptvector: text visual analytics via interactive lexicon building using word embedding. IEEE Trans Vis Comput Gr 24(1):361–370
Pennebaker JW, Francis ME, Booth RJ (2001) Linguistic inquiry and word count: Liwc 2001, vol 71. Mahway: Lawrence Erlbaum Associates
Riloff E, Shepherd J (1997) A corpus-based approach for building semantic lexicons. In: Proceedings of the second conference on empirical methods in natural language processing (EMNLP-2), pp 117–124
Riloff E, Shepherd J (1999) A corpus-based bootstrapping algorithm for semi-automated semantic lexicon construction. Nat Lang Eng 5(2):147–156
Roark B, Charniak E (1998) Noun-phrase co-occurrence statistics for semiautomatic semantic lexicon construction. In: Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics, vol 2, Association for Computational Linguistics, pp 1110–1116
Sagot B (2005) Automatic acquisition of a Slovak lexicon from a raw corpus. In: International conference on text, speech and dialogue, Springer, pp 156–163
Tekiroglu SS, Özbal G, Strapparava C (2014) Sensicon: an automatically constructed sensorial lexicon. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1511–1521
Tonelli S, Pighin D (2009) New features for framenet: WordNet mapping. In: Proceedings of the thirteenth conference on computational natural language learning, Association for Computational Linguistics, pp 219–227
Verma N, Bhattacharyya P (2004) Automatic lexicon generation through WordNet. GWC 2004:226
Voorhees EM (1994) Query expansion using lexical-semantic relations. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, ACM Press, pp 61–69
Zhang J, Deng B, Li X (2009) Concept based query expansion using WordNet. In: Proceedings of the 2009 international e-conference on advanced science and technology, IEEE Computer Society, pp 52–55
Zhu M, Wu YFB (2014) Search by multiple examples. In: Proceedings of the 7th ACM international conference on Web search and data mining, ACM Press, pp 667–672
Acknowledgements
This work was partially supported by SoundCITYve project from Labex IMU.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mpouli, S., Beigbeder, M. & Largeron, C. Lexifield: a system for the automatic building of lexicons by semantic expansion of short word lists. Knowl Inf Syst 62, 3181–3201 (2020). https://doi.org/10.1007/s10115-020-01451-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-020-01451-6