Skip to main content
Log in

Learning from similarity and information extraction from structured documents

  • Special Issue Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

The automation of document processing has recently gained attention owing to its great potential to reduce manual work. Any improvement in information extraction systems or reduction in their error rates aids companies working with business documents because lowering reliance on cost-heavy and error-prone human work significantly improves the revenue. Neural networks have been applied to this area before, but they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve information extraction, we compiled a dataset with more than 25,000 documents. We expand on our previous work in which we proved that convolutions, graph convolutions, and self-attention can work together and exploit all the information within a structured document. Taking the fully trainable method one step further, we now design and examine various approaches to using Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness. The aim is to improve micro \(F_{1}\) of per-word classification in the huge real-world document dataset. The results verify that trainable access to a similar (yet still different) page, together with its already known target information, improves the information extraction. The experiments confirm that all proposed architecture parts (Siamese networks, employing class information, query-answer attention module and skip connections to a similar page) are all required to beat the previous results. The best model yields an 8.25% gain in the \(F_{1}\) score over the previous state-of-the-art results. Qualitative analysis verifies that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed, since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

An anonymized version of the dataset is publicly available at [22], together with all the codes. The improvement on previous results can be reproduced using the anonymized data without disclosing any sensitive information.

Code availability The source codes are publicly available from a GitHub repository [22].

References

  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org (2015). https://www.tensorflow.org/

  2. Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: feature selection for opinion classification in web forums. ACM Trans. Inf. Syst 26(3), 1–34 (2008). https://doi.org/10.1145/1361684.1361685

    Article  Google Scholar 

  3. Arsenault, M.O.: Lossless triplet loss (2018). https://towardsdatascience.com/lossless-triplet-loss-7e932f990b24

  4. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  5. Burkov, A.: Machine Learning Engineering. True Positive Incorporated (2020)

  6. Cai, Q., Pan, Y., Yao, T., Yan, C., Mei, T.: Memory matching networks for one-shot image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4080–4088 (2018)

  7. Chen, Z., Huang, L., Yang, W., Meng, P., Miao, H.: More than word frequencies: authorship attribution via natural frequency zoned word distribution analysis. CoRR (2012). arXiv:1208.3001

  8. Coüasnon, B., Lemaitre, A.: Recognition of Tables and Forms, pp. 647–677. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_20

    Book  Google Scholar 

  9. Cowie, J., Lehnert, W.: Information extraction. Commun. ACM 39, 80–91 (1996)

    Article  Google Scholar 

  10. Dalvi, B.B., Cohen, W.W., Callan, J.: Websets: extracting sets of entities from the web using unsupervised information extraction. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 243–252 (2012)

  11. d’Andecy, V.P., Hartmann, E., Rusinol, M.: Field extraction by hybrid incremental and a-priori structural templates. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 251–256 (2018). https://doi.org/10.1109/DAS.2018.29

  12. Dhakal, P., Munikar, M., Dahal, B.: One-shot template matching for automatic document data capture. In: 2019 Artificial Intelligence for Transforming Business and Society (AITB), vol. 1, pp. 1–6 (2019). https://doi.org/10.1109/AITB48515.2019.8947440

  13. Eloff, R., Engelbrecht, H.A., Kamper, H.: Multimodal one-shot learning of speech and images. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8623–8627. IEEE (2019)

  14. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)

    Article  Google Scholar 

  15. Felix, R., Sasdelli, M., Reid, I., Carneiro, G.: Multi-modal ensemble classification for generalized zero shot learning. arXiv preprint arXiv:1901.04623 (2019)

  16. Galassi, A., Lippi, M., Torroni, P.: Attention in natural language processing. Computation and Language (2020)

  17. Ghosh, S.K., Valveny, E.: R-phoc: segmentation-free word spotting using CNN. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 801–806. IEEE (2017)

  18. Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1449–1453. IEEE (2013)

  19. Grigorescu, S.M.: Generative one-shot learning (GOL): s semi-parametric approach to one-shot learning in autonomous vision. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7127–7134. IEEE (2018)

  20. Hamza, H., Belaïd, Y., Belaïd, A.: Case-based reasoning for invoice analysis and recognition. In: Weber, R.O., Richter, M.M. (eds.) Case-Based Reasoning Research and Development, pp. 404–418. Springer, Berlin (2007)

    Chapter  Google Scholar 

  21. Holecek, M., Hoskovec, A., Baudis, P., Klinger, P.: Table understanding in structured documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 158–164 (2019). https://doi.org/10.1109/ICDARW.2019.40098

  22. Implementation details for this work, source codes and curated anonymized dataset to reproduce results. https://github.com/Darthholi/similarity-models

  23. Jean-Pierre Tixier, A., Nikolentzos, G., Meladianos, P., Vazirgiannis, M.: Graph Classification with 2D Convolutional Neural Networks. arXiv e-prints arXiv:1708.02218 (2017)

  24. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2. Lille (2015)

  25. Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information extraction in structured documents using tree automata induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) Principles of Data Mining and Knowledge Discovery, pp. 299–311. Springer, Berlin (2002)

    Chapter  Google Scholar 

  26. Krieger, F., Drews, P., Funk, B., Wobbe, T.: Information extraction from invoices: a graph neural network approach for datasets with high layout variety (2021)

  27. Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simple visual concepts. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 33 (2011)

  28. Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: The omniglot challenge: a 3-year progress report. Curr. Opin. Behav. Sci. 29, 97–104 (2019)

    Article  Google Scholar 

  29. Lampinen, A.K., McClelland, J.L.: One-shot and few-shot learning of word embeddings. CoRR (2017). arXiv:1710.10280

  30. Lin, Z., Davis, L.S.: Learning pairwise dissimilarity profiles for appearance recognition in visual surveillance. In: Advances in Visual Computing, pp. 23–34. Springer, Berlin (2008)

  31. Liu, R., Lehman, J., Molino, P., Petroski Such, F., Frank, E., Sergeev, A., Yosinski, J.: An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution. arXiv e-prints arXiv:1807.03247 (2018)

  32. Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. arXiv preprint arXiv:1903.11279 (2019)

  33. Lohani, D., Abdel, B., Belaïd, Y.: An Invoice Reading System using a Graph Convolutional Network. In: International Workshop on Robust Reading. PERTH, Australia (2018). https://hal.inria.fr/hal-01960846

  34. Manual typing is expensive: The tco of invoice data capture (part 2). https://rossum.ai/blog/manual-typing-is-expensive-the-tco-of-invoice-data-capture-part-2/

  35. Mehrotra, A., Dukkipati, A.: Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033 (2017)

  36. Meta learning papers. https://github.com/floodsung/Meta-Learning-Papers

  37. Mishra, A., Krishna Reddy, S., Mittal, A., Murthy, H.A.: A generative model for zero shot learning using conditional variational autoencoders. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2188–2196 (2018)

  38. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)

    Article  Google Scholar 

  39. Narasimhan, H., Pan, W., Kar, P., Protopapas, P., Ramaswamy, H.G.: Optimizing the multiclass f-measure via biconcave programming. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1101–1106. IEEE (2016)

  40. Nie, Y.P., Han, Y., Huang, J.M., Jiao, B., Li, A.P.: Attention-based encoder–decoder model for answer selection in question answering. Front. Inf. Technol. Electron. Eng. 18(4), 535–544 (2017)

    Article  Google Scholar 

  41. Niepert, M., Ahmed, M., Kutzkov, K.: Learning Convolutional Neural Networks for Graphs. arXiv e-prints arXiv:1605.05273 (2016)

  42. Palm, R., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336 (2019)

  43. Paul, A., Krishnan, N.C., Munjal, P.: Semantically aligned bias reducing zero shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7056–7065 (2019)

  44. Peng, H.: A comprehensive overview and survey of recent advances in meta-learning. arXiv:2004.11149 (2020)

  45. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  46. Riba, P., Dutta, A., Goldmann, L., Fornes, A., Ramos, O., Llados, J.: Table detection in invoice documents by graph neural networks, pp. 122–127 (2019). https://doi.org/10.1109/ICDAR.2019.00028

  47. Rossum’s blogpost “extracting invoices using ai” at medium.com. https://medium.com/@bzamecnik/extracting-invoices-using-ai-in-a-few-lines-of-code-96e412df7a7a

  48. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065 (2016)

  49. Smith, D., Lopez, M.: Information extraction for semi-structured documents. In: Proceedings of the Workshop on Management of Semistructured Data (1997)

  50. Tenhunen, M., Penttinen, E.: Assessing the carbon footprint of paper vs. electronic invoicing (2010). https://aisel.aisnet.org/acis2010/95

  51. Thakurdesai, N., Raut, N., Tripathi, A.: Face recognition using one-shot learning. Int. J. Comput. Appl. 182, 35–39 (2018). https://doi.org/10.5120/ijca2018918032

    Article  Google Scholar 

  52. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is All You Need. arXiv e-prints arXiv:1706.03762 (2017)

  53. Vinyals, O., Blundell, C., Lillicrap, T.P., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. CoRR (2016). arXiv:1606.04080

  54. Wang, P., Liu, L., Shen, C., Huang, Z., van den Hengel, A., Tao Shen, H.: Multi-attention network for one shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2721–2729 (2017)

  55. Xu, L., Wang, Y., Li, X., Pan, M.: Recognition of handwritten Chinese characters based on concept learning. IEEE Access 7, 102039–102053 (2019)

    Article  Google Scholar 

  56. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21–29 (2016)

  57. Yim, J., Kim, J., Shin, D.: One-shot item search with multimodal data. arXiv:1811.10969 (2018)

  58. Yin, W.: Meta-learning for few-shot natural language processing: a survey. arXiv:2007.09604 (2020)

Download references

Acknowledgements

The Rossum.ai team deserves thanks for providing the data and background that enabled the development and growth of this work.

Funding

This work was supported by the Grant SVV-2020-260583. Partial financial support was received from Rossum and Charles University.

Author information

Authors and Affiliations

Authors

Contributions

The principal author is responsible for the study concept and design, execution, coding, and research. The rest of the Rossum team is responsible for data acquisition, annotation, and storage and for the creation of a working product and environment that enabled a scientific study of this scope.

Corresponding author

Correspondence to Martin Holeček.

Ethics declarations

Conflict of interest

The author (Martin Holeček) has received financial support from Rossum and from Charles University, where he is currently pursuing a PhD. The author has an employment and/or contractual relationship with Rossum, Medicalc, and AMP Solar Group.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the Grant SVV-2020-260583.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Holeček, M. Learning from similarity and information extraction from structured documents. IJDAR 24, 149–165 (2021). https://doi.org/10.1007/s10032-021-00375-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-021-00375-3

Keywords

Navigation