当前位置: X-MOL 学术Int. J. Doc. Anal. Recognit. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning from similarity and information extraction from structured documents
International Journal on Document Analysis and Recognition ( IF 2.3 ) Pub Date : 2021-06-11 , DOI: 10.1007/s10032-021-00375-3
Martin Holeček

The automation of document processing has recently gained attention owing to its great potential to reduce manual work. Any improvement in information extraction systems or reduction in their error rates aids companies working with business documents because lowering reliance on cost-heavy and error-prone human work significantly improves the revenue. Neural networks have been applied to this area before, but they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve information extraction, we compiled a dataset with more than 25,000 documents. We expand on our previous work in which we proved that convolutions, graph convolutions, and self-attention can work together and exploit all the information within a structured document. Taking the fully trainable method one step further, we now design and examine various approaches to using Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness. The aim is to improve micro \(F_{1}\) of per-word classification in the huge real-world document dataset. The results verify that trainable access to a similar (yet still different) page, together with its already known target information, improves the information extraction. The experiments confirm that all proposed architecture parts (Siamese networks, employing class information, query-answer attention module and skip connections to a similar page) are all required to beat the previous results. The best model yields an 8.25% gain in the \(F_{1}\) score over the previous state-of-the-art results. Qualitative analysis verifies that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed, since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.



中文翻译:

从结构化文档中学习相似性和信息提取

文档处理的自动化最近因其减少人工工作的巨大潜力而​​受到关注。信息提取系统的任何改进或错误率的降低都有助于公司处理业务文档,因为减少对成本高且容易出错的人工工作的依赖显着提高了收入。神经网络之前已经应用于该领域,但到目前为止,它们仅在具有数百个文档的相对较小的数据集上进行了训练。为了成功探索深度学习技术并改进信息提取,我们编译了一个包含 25,000 多个文档的数据集。我们扩展了我们之前的工作,我们证明了卷积、图卷积和自注意力可以协同工作并利用结构化文档中的所有信息。将完全可训练的方法更进一步,我们现在设计和研究使用连体网络、相似性概念、一次性学习和上下文/记忆意识的各种方法。目的是提高微\(F_{1}\)在巨大的现实世界文档数据集中每个词的分类。结果验证了对类似(但仍然不同)页面的可训练访问,连同其已知的目标信息,改进了信息提取。实验证实,所有提出的架构部分(连体网络、使用类信息、查询-答案注意模块和跳转到类似页面的连接)都需要击败之前的结果。最佳模型在\(F_{1}\) 中产生了 8.25% 的增益得分超过之前的最新结果。定性分析验证新模型对所有目标类别的表现都更好。此外,还揭示了有关某些架构性能不佳的原因的多种结构观察,因为这项工作中使用的所有技术都不是特定于问题的,并且可以推广到其他任务和上下文。

更新日期:2021-06-11
down
wechat
bug