ChemTables: a dataset for semantic classification on tables in chemical patents,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ChemTables: a dataset for semantic classification on tables in chemical patents
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2021-12-11 , DOI: 10.1186/s13321-021-00568-2
Zenan Zhai ₁ , Christian Druckenbrodt ₂ , Camilo Thorne ₂ , Saber A Akhondi ₂ , Dat Quoc Nguyen _{1,

3} , Trevor Cohn ₁ , Karin Verspoor _{1,

4}

Affiliation

Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged $$F_1$$ score on the table classification task. The ChemTables dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3 , subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables .

中文翻译：

ChemTables：化学专利表格语义分类的数据集

化学专利是披露新化合物和反应的常用渠道，因此代表了化学和药物研究的重要资源。专利中的关键化学数据通常以表格形式呈现。专利文献中表格的数量和大小都可能非常大。此外，专利表格中还可以呈现各种类型的信息，包括光谱和物理数据，或化学品的药理学用途和作用。由于马库什结构和合并单元格的图像在这些表中常用，因此它们的结构也显示出很大的变化。化学专利中表格内容和结构的异质性使得相关信息难以查找。因此，我们提出了一种新的文本挖掘任务，根据化学专利的内容自动对表格进行分类。根据内容性质对表格进行分类有助于识别包含关键信息的表格，从而提高与新发明高度相关的专利信息的可访问性。为了开发和评估表格分类任务的方法，我们开发了一个名为 ChemTables 的新数据集，其中包含 788 个化学专利表及其内容类型标签。我们详细介绍这个数据集。我们通过应用为自然语言处理开发的最先进的神经网络模型（包括 ChemTables 上的 TabNet、ResNet 和 Table-BERT），进一步为化学专利中的表格分类任务建立强大的基线。表现最好的模型 Table-BERT 在表分类任务上取得了 88.66 微平均 $$F_1$$ 分数的性能。ChemTables 数据集可在 https://doi.org/10.17632/g7tjh7tbrj.3 上公开获取，并遵守 CC BY NC 3.0 许可证。本工作中评估的代码/模型位于 Github 存储库 https://github.com/zenanz/ChemTables 中。

更新日期：2021-12-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>