当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PatSeg: A Sequential Patent Segmentation Approach
Big Data Research ( IF 3.3 ) Pub Date : 2020-05-04 , DOI: 10.1016/j.bdr.2020.100133
Maryam Habibi , Astrid Rheinlaender , Wolfgang Thielemann , Robert Adams , Peter Fischer , Sylvia Krolkiewicz , David Luis Wiegandt , Ulf Leser

Patents are an important source of information in industry and academia. However, quickly grasping the essence of a given patent is difficult as they typically are very long and written in a rather inaccessible style. These essential information, especially the invention itself and the experimental part of the invention, are usually contained in the description section. However, in many patents the description parts are neither annotated nor easily detectable. Here, we describe our novel PatSeg method for patent segmentation, which aims at automatically and directly identifying the most important parts of a patent. PatSeg uses a two-step approach, where a patent is first segmented into text blocks in an unsupervised fashion followed by a supervised classification step for each identified segment. In contrast to previous work, PatSeg uses semantic word embeddings in both phases and applies a sequential learning algorithm for the second step. These modifications lead to, on average, an improvement of 9.47% (8.78%, 9.00%) in terms of F1-score (precision, recall) and 7.29 in terms of accuracy in comparison to a baseline, as evaluated on two novel and manually segmented gold standard patent corpora. The method also is easily parallelizable, fast, making it applicable for truly large patent collections.



中文翻译:

PatSeg:连续专利分割方法

专利是工业界和学术界的重要信息来源。但是,要快速掌握给定专利的本质是困难的,因为它们通常很长并且以相当难以接近的风格书写。这些基本信息,特别是发明本身和发明的实验部分,通常包含在说明部分中。然而,在许多专利中,描述部分既未注释也不易于检测。在这里,我们描述了用于专利分割的新颖PatSeg方法,该方法旨在自动和直接识别专利的最重要部分。PatSeg使用两步方法,其中首先将专利以无监督的方式细分为文本块,然后对每个标识的分段进行有监督的分类步骤。与以前的工作相比,PatSeg在两个阶段都使用语义词嵌入,并在第二步中应用了顺序学习算法。通过对两种新颖和手动进行的评估,这些修改平均使F1得分(精确度,召回率)提高了9.47%(8.78%,9.00%),与基线相比,其准确性提高了7.29。分段黄金标准专利库。该方法还易于并行化,快速化,从而适用于真正的大型专利馆藏。

更新日期:2020-05-04
down
wechat
bug