当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
Knowledge-Based Systems ( IF 5.101 ) Pub Date : 2020-01-09 , DOI: 10.1016/j.knosys.2019.105436
Pengfei Li; Kezhi Mao; Yuecong Xu; Qi Li; Jiaheng Zhang

Text representation, a crucial step for text mining and natural language processing, concerns about transforming unstructured textual data into structured numerical vectors to support various machine learning and data mining algorithms. For document classification, one classical and commonly adopted text representation method is Bag-of-Words (BoW) model. BoW represents document as a fixed-length vector of terms, where each term dimension is a numerical value such as term frequency or tf-idf weight. However, BoW simply looks at surface form of words. It ignores the semantic, conceptual and contextual information of texts, and also suffers from high dimensionality and sparsity issues. To address the aforementioned issues, we propose a novel document representation scheme called Bag-of-Concepts (BoC), which automatically acquires useful conceptual knowledge from external knowledge base, then conceptualizes words and phrases in the document into higher level semantics (i.e. concepts) in a probabilistic manner, and eventually represents a document as a distributed vector in the learned concept space. By utilizing background knowledge from knowledge base, BoC representation is able to provide more semantic and conceptual information of texts, as well as better interpretability for human understanding. We also propose Bag-of-Concept-Clusters (BoCCl) model which clusters semantically similar concepts together and performs entity sense disambiguation to further improve BoC representation. In addition, we combine BoCCl and BoW representaions using an attention mechanism to effectively utilize both concept-level and word-level information and achieve optimal performance for document classification.
更新日期:2020-01-09

 

全部期刊列表>>
Springer Nature 2019高下载量文章和章节
化学/材料学中国作者研究精选
《科学报告》最新环境科学研究
ACS材料视界
自然科研论文编辑服务
中南大学国家杰青杨华明
剑桥大学-
中国科学院大学化学科学学院
材料化学和生物传感方向博士后招聘
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug