Transformers-based information extraction with limited data for domain-specific business documents,Engineering Applications of Artificial Intelligence

当前位置： X-MOL 学术 › Eng. Appl. Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transformers-based information extraction with limited data for domain-specific business documents
Engineering Applications of Artificial Intelligence ( IF 8 ) Pub Date : 2020-11-24 , DOI: 10.1016/j.engappai.2020.104100
Minh-Tien Nguyen , Dung Tien Le , Linh Le

Information extraction plays an important role for data transformation in business cases. However, building extraction systems in actual cases face two challenges: (i) the availability of labeled data is usually limited and (ii) highly detailed classification is required. This paper introduces a model for addressing the two challenges. Different from prior studies that usually require a large number of training samples, our extraction model is trained with a small number of data for extracting a large number of information types. To do that, the model takes into account the contextual aspect of pre-trained language models trained on a huge amount of data on general domains for word representation. To adapt to our downstream task, the model employs transfer learning by stacking Convolutional Neural Networks to learn hidden representation for classification. To confirm the efficiency of our method, we apply the model to two actual cases of document processing for bidding and sale documents of two Japanese companies. Experimental results on real testing sets show that, with a small number of training data, our model achieves high accuracy accepted by our clients.

中文翻译：

基于变压器的信息提取，针对特定领域的业务文档具有有限的数据

信息提取在业务案例中对数据转换起着重要作用。但是，在实际情况下构建提取系统面临两个挑战：（i）标记数据的可用性通常有限，并且（ii）需要高度详细的分类。本文介绍了应对两个挑战的模型。与通常需要大量训练样本的先前研究不同，我们的提取模型使用少量数据进行训练，以提取大量信息类型。为此，该模型考虑了在通用领域中为词表示法在大量数据上进行训练的预训练语言模型的上下文方面。为了适应我们的下游任务，该模型通过堆叠卷积神经网络采用转移学习来学习用于分类的隐藏表示。为了确认我们方法的效率，我们将该模型应用于两家日本公司的招标和销售文件的两个实际文件处理案例。在真实测试集上的实验结果表明，通过少量的训练数据，我们的模型可以达到客户所接受的高精度。

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>