当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
Knowledge-Based Systems ( IF 8.8 ) Pub Date : 2020-01-09 , DOI: 10.1016/j.knosys.2019.105436
Pengfei Li , Kezhi Mao , Yuecong Xu , Qi Li , Jiaheng Zhang

Text representation, a crucial step for text mining and natural language processing, concerns about transforming unstructured textual data into structured numerical vectors to support various machine learning and data mining algorithms. For document classification, one classical and commonly adopted text representation method is Bag-of-Words (BoW) model. BoW represents document as a fixed-length vector of terms, where each term dimension is a numerical value such as term frequency or tf-idf weight. However, BoW simply looks at surface form of words. It ignores the semantic, conceptual and contextual information of texts, and also suffers from high dimensionality and sparsity issues. To address the aforementioned issues, we propose a novel document representation scheme called Bag-of-Concepts (BoC), which automatically acquires useful conceptual knowledge from external knowledge base, then conceptualizes words and phrases in the document into higher level semantics (i.e. concepts) in a probabilistic manner, and eventually represents a document as a distributed vector in the learned concept space. By utilizing background knowledge from knowledge base, BoC representation is able to provide more semantic and conceptual information of texts, as well as better interpretability for human understanding. We also propose Bag-of-Concept-Clusters (BoCCl) model which clusters semantically similar concepts together and performs entity sense disambiguation to further improve BoC representation. In addition, we combine BoCCl and BoW representaions using an attention mechanism to effectively utilize both concept-level and word-level information and achieve optimal performance for document classification.



中文翻译:

基于从概率知识库中自动获取知识的文档分类的概念包表示

文本表示是文本挖掘和自然语言处理的关键步骤,它涉及将非结构化文本数据转换为结构化数值向量以支持各种机器学习和数据挖掘算法的问题。对于文档分类,一种经典且常用的文本表示方法是词袋(BoW)模型。BoW将文档表示为术语的固定长度向量,其中每个术语维是一个数字值,例如术语频率或tf-idf权重。但是,BoW只是看单词的表面形式。它忽略了文本的语义,概念和上下文信息,并且还存在高维度和稀疏性的问题。为解决上述问题,我们提出了一种新颖的文档表示方案,称为概念包(BoC),它从外部知识库自动获取有用的概念知识,然后以概率方式将文档中的单词和短语概念化为高级语义(即概念),并最终将文档表示为学习的概念空间中的分布式矢量。通过利用知识库中的背景知识,BoC表示能够提供更多的文本语义和概念信息,以及更好的可理解性,以供人类理解。我们还提出了概念包(BoCCl)模型,该模型将语义上相似的概念聚集在一起,并执行实体意义上的歧义消除,以进一步改善BoC表示。此外,

更新日期:2020-01-09
down
wechat
bug