Learning from similarity and information extraction from structured documents

Holeček, Martin

doi:10.1007/s10032-021-00375-3

Learning from similarity and information extraction from structured documents

Special Issue Paper
Published: 11 June 2021

Volume 24, pages 149–165, (2021)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Martin Holeček ORCID: orcid.org/0000-0002-1008-1567¹

627 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

The automation of document processing has recently gained attention owing to its great potential to reduce manual work. Any improvement in information extraction systems or reduction in their error rates aids companies working with business documents because lowering reliance on cost-heavy and error-prone human work significantly improves the revenue. Neural networks have been applied to this area before, but they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve information extraction, we compiled a dataset with more than 25,000 documents. We expand on our previous work in which we proved that convolutions, graph convolutions, and self-attention can work together and exploit all the information within a structured document. Taking the fully trainable method one step further, we now design and examine various approaches to using Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness. The aim is to improve micro \(F_{1}\) of per-word classification in the huge real-world document dataset. The results verify that trainable access to a similar (yet still different) page, together with its already known target information, improves the information extraction. The experiments confirm that all proposed architecture parts (Siamese networks, employing class information, query-answer attention module and skip connections to a similar page) are all required to beat the previous results. The best model yields an 8.25% gain in the \(F_{1}\) score over the previous state-of-the-art results. Qualitative analysis verifies that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed, since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Low-Dimensionality Information Extraction Model for Semi-structured Documents

Analysis and Design of Document Similarity Using BiLSTM and BERT

Encoding Document Semantic into Binary Codes Space

Data availability

An anonymized version of the dataset is publicly available at [22], together with all the codes. The improvement on previous results can be reproduced using the anonymized data without disclosing any sensitive information.

Code availability The source codes are publicly available from a GitHub repository [22].

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org (2015). https://www.tensorflow.org/
Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: feature selection for opinion classification in web forums. ACM Trans. Inf. Syst 26(3), 1–34 (2008). https://doi.org/10.1145/1361684.1361685
Article Google Scholar
Arsenault, M.O.: Lossless triplet loss (2018). https://towardsdatascience.com/lossless-triplet-loss-7e932f990b24
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Burkov, A.: Machine Learning Engineering. True Positive Incorporated (2020)
Cai, Q., Pan, Y., Yao, T., Yan, C., Mei, T.: Memory matching networks for one-shot image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4080–4088 (2018)
Chen, Z., Huang, L., Yang, W., Meng, P., Miao, H.: More than word frequencies: authorship attribution via natural frequency zoned word distribution analysis. CoRR (2012). arXiv:1208.3001
Coüasnon, B., Lemaitre, A.: Recognition of Tables and Forms, pp. 647–677. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_20
Book Google Scholar
Cowie, J., Lehnert, W.: Information extraction. Commun. ACM 39, 80–91 (1996)
Article Google Scholar
Dalvi, B.B., Cohen, W.W., Callan, J.: Websets: extracting sets of entities from the web using unsupervised information extraction. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 243–252 (2012)
d’Andecy, V.P., Hartmann, E., Rusinol, M.: Field extraction by hybrid incremental and a-priori structural templates. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 251–256 (2018). https://doi.org/10.1109/DAS.2018.29
Dhakal, P., Munikar, M., Dahal, B.: One-shot template matching for automatic document data capture. In: 2019 Artificial Intelligence for Transforming Business and Society (AITB), vol. 1, pp. 1–6 (2019). https://doi.org/10.1109/AITB48515.2019.8947440
Eloff, R., Engelbrecht, H.A., Kamper, H.: Multimodal one-shot learning of speech and images. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8623–8627. IEEE (2019)
Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)
Article Google Scholar
Felix, R., Sasdelli, M., Reid, I., Carneiro, G.: Multi-modal ensemble classification for generalized zero shot learning. arXiv preprint arXiv:1901.04623 (2019)
Galassi, A., Lippi, M., Torroni, P.: Attention in natural language processing. Computation and Language (2020)
Ghosh, S.K., Valveny, E.: R-phoc: segmentation-free word spotting using CNN. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 801–806. IEEE (2017)
Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1449–1453. IEEE (2013)
Grigorescu, S.M.: Generative one-shot learning (GOL): s semi-parametric approach to one-shot learning in autonomous vision. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7127–7134. IEEE (2018)
Hamza, H., Belaïd, Y., Belaïd, A.: Case-based reasoning for invoice analysis and recognition. In: Weber, R.O., Richter, M.M. (eds.) Case-Based Reasoning Research and Development, pp. 404–418. Springer, Berlin (2007)
Chapter Google Scholar
Holecek, M., Hoskovec, A., Baudis, P., Klinger, P.: Table understanding in structured documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 158–164 (2019). https://doi.org/10.1109/ICDARW.2019.40098
Implementation details for this work, source codes and curated anonymized dataset to reproduce results. https://github.com/Darthholi/similarity-models
Jean-Pierre Tixier, A., Nikolentzos, G., Meladianos, P., Vazirgiannis, M.: Graph Classification with 2D Convolutional Neural Networks. arXiv e-prints arXiv:1708.02218 (2017)
Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2. Lille (2015)
Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information extraction in structured documents using tree automata induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) Principles of Data Mining and Knowledge Discovery, pp. 299–311. Springer, Berlin (2002)
Chapter Google Scholar
Krieger, F., Drews, P., Funk, B., Wobbe, T.: Information extraction from invoices: a graph neural network approach for datasets with high layout variety (2021)
Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simple visual concepts. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 33 (2011)
Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: The omniglot challenge: a 3-year progress report. Curr. Opin. Behav. Sci. 29, 97–104 (2019)
Article Google Scholar
Lampinen, A.K., McClelland, J.L.: One-shot and few-shot learning of word embeddings. CoRR (2017). arXiv:1710.10280
Lin, Z., Davis, L.S.: Learning pairwise dissimilarity profiles for appearance recognition in visual surveillance. In: Advances in Visual Computing, pp. 23–34. Springer, Berlin (2008)
Liu, R., Lehman, J., Molino, P., Petroski Such, F., Frank, E., Sergeev, A., Yosinski, J.: An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution. arXiv e-prints arXiv:1807.03247 (2018)
Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. arXiv preprint arXiv:1903.11279 (2019)
Lohani, D., Abdel, B., Belaïd, Y.: An Invoice Reading System using a Graph Convolutional Network. In: International Workshop on Robust Reading. PERTH, Australia (2018). https://hal.inria.fr/hal-01960846
Manual typing is expensive: The tco of invoice data capture (part 2). https://rossum.ai/blog/manual-typing-is-expensive-the-tco-of-invoice-data-capture-part-2/
Mehrotra, A., Dukkipati, A.: Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033 (2017)
Meta learning papers. https://github.com/floodsung/Meta-Learning-Papers
Mishra, A., Krishna Reddy, S., Mittal, A., Murthy, H.A.: A generative model for zero shot learning using conditional variational autoencoders. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2188–2196 (2018)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Article Google Scholar
Narasimhan, H., Pan, W., Kar, P., Protopapas, P., Ramaswamy, H.G.: Optimizing the multiclass f-measure via biconcave programming. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1101–1106. IEEE (2016)
Nie, Y.P., Han, Y., Huang, J.M., Jiao, B., Li, A.P.: Attention-based encoder–decoder model for answer selection in question answering. Front. Inf. Technol. Electron. Eng. 18(4), 535–544 (2017)
Article Google Scholar
Niepert, M., Ahmed, M., Kutzkov, K.: Learning Convolutional Neural Networks for Graphs. arXiv e-prints arXiv:1605.05273 (2016)
Palm, R., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336 (2019)
Paul, A., Krishnan, N.C., Munjal, P.: Semantically aligned bias reducing zero shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7056–7065 (2019)
Peng, H.: A comprehensive overview and survey of recent advances in meta-learning. arXiv:2004.11149 (2020)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Riba, P., Dutta, A., Goldmann, L., Fornes, A., Ramos, O., Llados, J.: Table detection in invoice documents by graph neural networks, pp. 122–127 (2019). https://doi.org/10.1109/ICDAR.2019.00028
Rossum’s blogpost “extracting invoices using ai” at medium.com. https://medium.com/@bzamecnik/extracting-invoices-using-ai-in-a-few-lines-of-code-96e412df7a7a
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065 (2016)
Smith, D., Lopez, M.: Information extraction for semi-structured documents. In: Proceedings of the Workshop on Management of Semistructured Data (1997)
Tenhunen, M., Penttinen, E.: Assessing the carbon footprint of paper vs. electronic invoicing (2010). https://aisel.aisnet.org/acis2010/95
Thakurdesai, N., Raut, N., Tripathi, A.: Face recognition using one-shot learning. Int. J. Comput. Appl. 182, 35–39 (2018). https://doi.org/10.5120/ijca2018918032
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is All You Need. arXiv e-prints arXiv:1706.03762 (2017)
Vinyals, O., Blundell, C., Lillicrap, T.P., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. CoRR (2016). arXiv:1606.04080
Wang, P., Liu, L., Shen, C., Huang, Z., van den Hengel, A., Tao Shen, H.: Multi-attention network for one shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2721–2729 (2017)
Xu, L., Wang, Y., Li, X., Pan, M.: Recognition of handwritten Chinese characters based on concept learning. IEEE Access 7, 102039–102053 (2019)
Article Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21–29 (2016)
Yim, J., Kim, J., Shin, D.: One-shot item search with multimodal data. arXiv:1811.10969 (2018)
Yin, W.: Meta-learning for few-shot natural language processing: a survey. arXiv:2007.09604 (2020)

Download references

Acknowledgements

The Rossum.ai team deserves thanks for providing the data and background that enabled the development and growth of this work.

Funding

This work was supported by the Grant SVV-2020-260583. Partial financial support was received from Rossum and Charles University.

Author information

Authors and Affiliations

Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
Martin Holeček

Authors

Martin Holeček
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The principal author is responsible for the study concept and design, execution, coding, and research. The rest of the Rossum team is responsible for data acquisition, annotation, and storage and for the creation of a working product and environment that enabled a scientific study of this scope.

Corresponding author

Correspondence to Martin Holeček.

Ethics declarations

Conflict of interest

The author (Martin Holeček) has received financial support from Rossum and from Charles University, where he is currently pursuing a PhD. The author has an employment and/or contractual relationship with Rossum, Medicalc, and AMP Solar Group.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the Grant SVV-2020-260583.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Holeček, M. Learning from similarity and information extraction from structured documents. IJDAR 24, 149–165 (2021). https://doi.org/10.1007/s10032-021-00375-3

Download citation

Received: 18 October 2020
Revised: 17 May 2021
Accepted: 27 May 2021
Published: 11 June 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10032-021-00375-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from similarity and information extraction from structured documents

Abstract

Access this article

Similar content being viewed by others

Low-Dimensionality Information Extraction Model for Semi-structured Documents

Analysis and Design of Document Similarity Using BiLSTM and BERT

Encoding Document Semantic into Binary Codes Space

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning from similarity and information extraction from structured documents

Abstract

Access this article

Similar content being viewed by others

Low-Dimensionality Information Extraction Model for Semi-structured Documents

Analysis and Design of Document Similarity Using BiLSTM and BERT

Encoding Document Semantic into Binary Codes Space

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation