当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Linguistically Informed Masking for Representation Learning in the Patent Domain
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05768
Sophia Althammer, Mark Buckley, Sebastian Hofstätter, Allan Hanbury

Domain-specific contextualized language models have demonstrated substantial effectiveness gains for domain-specific downstream tasks, like similarity matching, entity recognition or information retrieval. However successfully applying such models in highly specific language domains requires domain adaptation of the pre-trained models. In this paper we propose the empirically motivated Linguistically Informed Masking (LIM) method to focus domain-adaptative pre-training on the linguistic patterns of patents, which use a highly technical sublanguage. We quantify the relevant differences between patent, scientific and general-purpose language and demonstrate for two different language models (BERT and SciBERT) that domain adaptation with LIM leads to systematically improved representations by evaluating the performance of the domain-adapted representations of patent language on two independent downstream tasks, the IPC classification and similarity matching. We demonstrate the impact of balancing the learning from different information sources during domain adaptation for the patent domain. We make the source code as well as the domain-adaptive pre-trained patent language models publicly available at https://github.com/sophiaalthammer/patent-lim.

中文翻译:

专利领域中表征学习的语言信息掩蔽

特定领域的上下文化语言模型已经证明了特定领域下游任务的显着有效性,如相似性匹配、实体识别或信息检索。然而,在高度特定的语言领域中成功应用此类模型需要对预训练模型进行领域适应。在本文中,我们提出了以经验为动机的语言信息屏蔽 (LIM) 方法,以将领域自适应预训练集中在使用高度技术性子语言的专利语言模式上。我们量化专利之间的相关差异,科学和通用语言,并针对两种不同的语言模型(BERT 和 SciBERT)证明,通过评估专利语言的域适应表示在两个独立的下游任务 IPC 分类上的性能,使用 LIM 进行域适应可以系统地改进表示和相似度匹配。我们展示了在专利域的域适应过程中平衡从不同信息源学习的影响。我们在 https://github.com/sophiaalthammer/patent-lim 上公开提供源代码以及域自适应预训练专利语言模型。我们展示了在专利域的域适应过程中平衡从不同信息源学习的影响。我们在 https://github.com/sophiaalthammer/patent-lim 上公开提供源代码以及域自适应预训练专利语言模型。我们展示了在专利域的域适应过程中平衡从不同信息源学习的影响。我们在 https://github.com/sophiaalthammer/patent-lim 上公开提供源代码以及域自适应预训练专利语言模型。
更新日期:2021-06-11
down
wechat
bug