当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving reference mining in patents with BERT
arXiv - CS - Information Retrieval Pub Date : 2021-01-04 , DOI: arxiv-2101.01039
Ken Voskuil, Suzan Verberne

References in patents to scientific literature provide relevant information for studying the relation between science and technological inventions. These references allow us to answer questions about the types of scientific work that leads to inventions. Most prior work analysing the citations between patents and scientific publications focussed on the front-page citations, which are well structured and provided in the metadata of patent archives such as Google Patents. In the 2019 paper by Verberne et al., the authors evaluate two sequence labelling methods for extracting references from patents: Conditional Random Fields (CRF) and Flair. In this paper we extend that work, by (1) improving the quality of the training data and (2) applying BERT-based models to the problem. We use error analysis throughout our work to find problems in the dataset, improve our models and reason about the types of errors different models are susceptible to. We first discuss the work by Verberne et al. and other related work in Section2. We describe the improvements we make in the dataset, and the new models proposed for this task. We compare the results of our new models with previous results, both on the labelled dataset and a larger unlabelled corpus. We end with a discussion on the characteristics of the results of our new models, followed by a conclusion. Our code and improved dataset are released under an open-source license on github.

中文翻译:

利用BERT改进专利的参考挖掘

专利中对科学文献的引用为研究科学技术发明之间的关系提供了相关信息。这些参考文献使我们能够回答有关导致发明的科学工作类型的问题。分析专利与科学出版物之间引文的大多数先前工作都集中在首页引文上,这些引文结构合理,并在Google专利等专利档案的元数据中提供。在Verberne等人的2019年论文中,作者评估了两种从专利中提取参考文献的序列标记方法:条件随机场(CRF)和Flair。在本文中,我们通过(1)提高训练数据的质量和(2)将基于BERT的模型应用于问题来扩展这项工作。我们在整个工作中都使用错误分析来发现数据集中的问题,改进我们的模型以及有关不同模型易受错误类型影响的原因。我们首先讨论Verberne等人的工作。和第2节中的其他相关工作。我们描述了我们在数据集中所做的改进以及为此任务建议的新模型。我们将新模型的结果与以前的结果(在标记数据集和较大的未标记语料库上)进行比较。我们首先讨论新模型结果的特征,然后得出结论。我们的代码和改进的数据集在github上的开源许可下发布。我们将新模型的结果与以前的结果进行比较,既有标记数据集,也有较大的未标记语料库。我们首先讨论新模型结果的特征,然后得出结论。我们的代码和改进的数据集在github上的开源许可下发布。我们将新模型的结果与以前的结果进行比较,既有标记数据集,也有较大的未标记语料库。我们首先讨论新模型结果的特征,然后得出结论。我们的代码和改进的数据集在github上的开源许可下发布。
更新日期:2021-01-05
down
wechat
bug