当前位置: X-MOL 学术IEEE Trans. Knowl. Data. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automated Phrase Mining from Massive Text Corpora
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2018-10-01 , DOI: 10.1109/tkde.2018.2812203
Jingbo Shang 1 , Jialu Liu 2 , Meng Jiang 1 , Xiang Ren 1 , Clare R Voss 3 , Jiawei Han 1
Affiliation  

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, $\mathsf{AutoPhrase}$, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, $\mathsf{AutoPhrase}$ has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, $\mathsf{AutoPhrase}$ can be extended to model single-word quality phrases.

中文翻译:


海量文本语料库中的自动短语挖掘



作为文本分析的基本任务之一,短语挖掘旨在从文本语料库中提取高质量的短语,并具有各种下游应用,包括信息提取/检索、分类法构建和主题建模。大多数现有方法依赖于复杂的、经过训练的语言分析器,因此在没有额外但昂贵的适应的情况下,在新领域和流派的文本语料库上的性能可能不能令人满意。最先进的模型,甚至数据驱动模型,都不是完全自动化的,因为它们需要人类专家来设计规则或标记短语。在本文中,我们提出了一种新的自动短语挖掘框架,$\mathsf{AutoPhrase}$,它支持任何语言,只要该语言的一般知识库(例如维基百科)可用,同时受益,但不需要一个 POS 标记器。与最先进的方法相比,$\mathsf{AutoPhrase}$ 在跨不同领域和语言的五个真实数据集上显示出有效性和效率的显着改进。此外,$\mathsf{AutoPhrase}$ 可以扩展到对单字质量短语进行建模。
更新日期:2018-10-01
down
wechat
bug