当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adaptive and hybrid context-aware fine-grained word sense disambiguation in topic modeling based document representation
Information Processing & Management ( IF 7.4 ) Pub Date : 2021-04-07 , DOI: 10.1016/j.ipm.2021.102592
Wenbo Li , Einoshin Suzuki

We propose a hybrid context based topic model with an adaptive context window length for word sense disambiguation in document representation. Document representation is an essential part of various document based tasks, and word sense disambiguation is to capture the distinctions of word senses in the representation. Traditional methods mainly rely on knowledge libraries for data enrichment; however, semantics division for a word may vary in different domain-specific datasets. We aim to discover finer-grained word semantic differences, such as different entities or standpoints, and handle the disambiguation problem without data enrichment. There are two challenges for this disambiguation task: (1) dividing various senses for each polysemous word, and (2) preserving the differences between synonyms. Most of the existing models are either based on separate context clusters or integrating an auxiliary module to specify word senses. They can hardly achieve both (1) and (2) since different senses of a word are assumed to be independent and their intrinsic relationships are ignored. To solve this problem, we introduce the “Bag-of-Senses” (BoS) assumption: a document is a multiset of word senses, and the senses are generated instead of the words. The word senses are estimated by both the context in which it occurs and the contexts of its other occurrences. Besides, to handle the different scopes of the sense related context to each word occurrence, we introduce a variable to adjust the context window length adaptively. Our experiments on three standard datasets show that our proposal outperforms other state-of-the-art methods in terms of word sense estimation, topic modeling, and document classification.



中文翻译:

基于主题建模的文档表示中的自适应和混合上下文感知细粒度词义消歧

我们提出了一种基于混合上下文的主题模型,该模型具有自适应的上下文窗口长度,用于文档表示中的词义消歧。文档表示是各种基于文档的任务的重要组成部分,单词歧义消除是为了捕获表示形式中单词含义的区别。传统方法主要依靠知识库进行数据丰富。但是,在不同领域特定的数据集中,单词的语义划分可能会有所不同。我们旨在发现更细粒度的单词语义差异(例如不同的实体或观点),并在不进行数据充实的情况下解决歧义消除问题。消除歧义性任务面临两个挑战:(1)为每个多义词划分各种含义,(2)保留同义词之间的差异。现有的大多数模型要么基于单独的上下文集群,要么集成了辅助模块以指定单词含义。由于一个词的不同含义被认为是独立的并且它们的内在关系被忽略了,所以它们几乎无法同时实现(1)和(2)。为了解决此问题,我们引入了“感觉袋”(BoS)假设:文档是词义的多集,并且生成了词义而不是词。感官这个词是根据它发生的上下文和其他出现的上下文来估计的。此外,为了处理每个单词出现的意义相关上下文的不同范围,我们引入了一个变量来自适应地调整上下文窗口的长度。

更新日期:2021-04-08
down
wechat
bug