当前位置: X-MOL 学术Scientometrics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset
Scientometrics ( IF 3.9 ) Pub Date : 2020-10-17 , DOI: 10.1007/s11192-020-03732-x
Cinthia M. Souza , Magali R. G. Meireles , Paulo E. M. Almeida

Patents are an important source of information for measuring the technological advancement of a specific knowledge domain. To facilitate the search for information in patent datasets, classification systems separate documents into groups according to the area of knowledge, and designate names to define their content. The increase in the number of patented inventions leads to the need to subdivide these groups. Since these groups belong to a restricted knowledge domain, naming the generated subcategories can be extremely laborious. This work aims to compare the performance of abstractive and extractive summarization techniques in the task of generating sentences directly associated with the content of patents. The abstractive summarization model was composed by a Seq2Seq architecture and a LSTM network. The training was conducted with a dataset of patent titles and abstracts. The validation process was performed using the ROUGE set of metrics. The results obtained by the generated model were compared with the sentence resulting from an extractive summarization algorithm applied to the task of naming patent groups. The main idea was to help the specialist to name new patent groups created by the clustering systems. The naming experiments were performed on the dataset of abstracts of patent documents. Comparative experiments were conducted using four subgroups of the United States Patent and Trademark Office, which uses the Cooperative Patent Classification system.

中文翻译:

在专利数据集上标记子组的抽象和提取摘要技术的比较研究

专利是衡量特定知识领域技术进步的重要信息来源。为了方便在专利数据集中搜索信息,分类系统根据知识领域将文档分成组,并指定名称来定义其内容。专利发明数量的增加导致需要细分这些群体。由于这些组属于有限的知识领域,因此命名生成的子类别可能非常费力。这项工作旨在比较抽象和提取摘要技术在生成与专利内容直接相关的句子的任务中的性能。抽象摘要模型由 Seq2Seq 架构和 LSTM 网络组成。培训是使用专利标题和摘要的数据集进行的。验证过程是使用 ROUGE 指标集执行的。将生成的模型获得的结果与应用于命名专利组任务的提取摘要算法产生的句子进行比较。主要想法是帮助专家命名由集群系统创建的新专利组。命名实验在专利文献摘要数据集上进行。比较实验是使用美国专利商标局的四个小组进行的,该小组使用合作专利分类系统。将生成的模型获得的结果与应用于命名专利组任务的提取摘要算法产生的句子进行比较。主要想法是帮助专家命名由集群系统创建的新专利组。命名实验在专利文献摘要数据集上进行。比较实验是使用美国专利商标局的四个小组进行的,该小组使用合作专利分类系统。将生成的模型获得的结果与应用于命名专利组任务的提取摘要算法产生的句子进行比较。主要想法是帮助专家命名由集群系统创建的新专利组。命名实验在专利文献摘要数据集上进行。比较实验是使用美国专利商标局的四个小组进行的,该小组使用合作专利分类系统。
更新日期:2020-10-17
down
wechat
bug