当前位置: X-MOL 学术Int. J. Med. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction.
International Journal of Medical Informatics ( IF 3.7 ) Pub Date : 2020-07-13 , DOI: 10.1016/j.ijmedinf.2020.104234
Lishan Yu 1 , Sheng Yu 2
Affiliation  

Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the size of data easy to work with. However, due to the extremely low prevalence of medical articles in the entire Wikipedia, articles identified by generic text classifiers would be bloated by irrelevant pages. To control the false discovery rate while maintaining a high recall, we developed a mechanism that leverages the rich page elements and the connected nature of Wikipedia and uses a crawling classification strategy to achieve accurate classification. Structured assertional knowledge in Infoboxes and Wikidata items associated with the identified medical articles were also extracted. This automatic mechanism is aimed to run periodically to update the results and share them with the informatics community.



中文翻译:


开发一种自动化机制来识别维基百科中的医学文章以提取知识。



维基百科包含丰富的生物医学信息,可以支持医学信息学的研究和应用。识别维基百科的医学文章子集有很多好处,例如促进医学知识提取、用作语言建模的语料库,或者只是使数据大小易于使用。然而,由于整个维基百科中医学文章的流行率极低,通用文本分类器识别的文章会因不相关的页面而变得臃肿。为了在保持高召回率的同时控制错误发现率,我们开发了一种机制,利用维基百科丰富的页面元素和连通性,并使用爬行分类策略来实现准确分类。还提取了与已识别的医学文章相关的信息框和维基数据项目中的结构化断言知识。这种自动机制旨在定期运行以更新结果并与信息学界共享。

更新日期:2020-07-18
down
wechat
bug