Skip to main content
Log in

Rich-text document styling restoration via reinforcement learning

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Richly formatted documents, such as financial disclosures, scientific articles, government regulations, widely exist on Web. However, since most of these documents are only for public reading, the styling information inside them is usually missing, making them improper or even burdensome to be displayed and edited in different formats and platforms. In this study we formulate the task of document styling restoration as an optimization problem, which aims to identify the styling settings on the document elements, e.g., lines, table cells, text, so that rendering with the output styling settings results in a document, where each element inside it holds the (closely) exact position with the one in the original document. Considering that each styling setting is a decision, this problem can be transformed as a multi-step decision-making task over all the document elements, and then be solved by reinforcement learning. Specifically, Monte-Carlo Tree Search (MCTS) is leveraged to explore the different styling settings, and the policy function is learnt under the supervision of the delayed rewards. As a case study, we restore the styling information inside tables, where structural and functional data in the documents are usually presented. Experiment shows that, our best reinforcement method successfully restores the stylings in 87.65% of the tables, with 25.75% absolute improvement over the greedy method. We also discuss the tradeoff between the inference time and restoration success rate, and argue that although the reinforcement methods cannot be used in real-time scenarios, it is suitable for the offline tasks with high-quality requirement. Finally, this model has been applied in a PDF parser to support cross-format display.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Wu S, Hsiao L, Cheng X, Hancock B, Rekatsinas T, Levis P, Ré C. Fonduer: knowledge base construction from richly formatted data. In: Proceedings of the 2018 International Conference on Management of Data. 2018, 1301–1316

  2. Chao H, Fan J. Layout and content extraction for pdf documents. In: Proceedings of the 6th International Workshop on Document Analysis Systems. 2004, 213–224

  3. Oro E, Ruffolo M. PDF-TREX: an approach for recognizing and extracting tables from pdf documents. In: Proceedings of the 10th International Conference on Document Analysis and Recognition. 2009, 906–910

  4. Wang Y, Hu J. A machine learning based approach for table detection on the web. In: Proceedings of the 11th International Conference on World Wide Web. 2002, 242–250

  5. Gilani A, Qasim S R, Malik I, Shafait F. Table detection using deep learning. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. 2017, 771–776

  6. He D, Cohen S, Price B, Kifer D, Giles C L. Multi-scale multi-task FCN for semantic page segmentation and table detection. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. 2017, 254–261

  7. Rashid S F, Akmal A, Adnan M, Aslam A A, Dengel A. Table recognition in heterogeneous documents using machine learning. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. 2017, 777–782

  8. Meunier J L. Optimized xy-cut for determining a page reading order. In: Proceedings of the 8th International Conference on Document Analysis and Recognition. 2005, 347–351

  9. Malerba D, Ceci M, Berardi M. Machine learning for reading order detection in document image understanding. In: Marinai S, Fujisawa H, eds. Machine Learning in Document Analysis and Recognition. Springer, Berlin, 2008

    Google Scholar 

  10. Fang J, Mitra P, Tang Z, Giles C L. Table header detection and classification. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence. 2012

  11. Schreiber S, Agne S, Wolf I, Dengel A, Ahmed S. Deepdesrt: deep learning for detection and structure recognition of tables in document images. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. 2017, 1162–1167

  12. Pinto D, McCallum A, Wei X, Croft W B. Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 2003, 235–242

  13. Nagy G, Seth S C, Jin D, Embley D W, Machado S, Krishnamoorthy M. Data extraction from web tables: the devil is in the details. In: Proceedings of the 11th International Conference on Document Analysis and Recognition. 2011, 242–246

  14. Chen X, Chiticariu L, Danilevsky M, Evfimievski A, Sen P. A rectangle mining method for understanding the semantics of financial tables. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. 2017, 268–273

  15. Wang H L, Wu S H, Wang I, Sung C L, Hsu W L, Shih W K. Semantic search on internet tabular information extraction for answering queries. In: Proceedings of the 9th International Conference on Information and Knowledge Management. 2000, 243–249

  16. Zhang S, Balog K. Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 World Wide Web Conference. 2018, 1553–1562

  17. Ghasemi-Gol M, Szekely P A. TabVec: table vectors for classification of web tables. 2018, arXiv preprint arXiv, 1802.06290

  18. Zhang S, Balog K. On-the-fly table generation. In: Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018, 595–604

  19. Zhang S, Balog K. Entitables: smart assistance for entity-focused tables. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017, 255–264

  20. Sutton R S, Barto A G. Reinforcement Learning: An Introduction. MIT Press, 2018

  21. Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533

    Article  Google Scholar 

  22. Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 2094–2100

  23. Anschel O, Baram N, Shimkin N. Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 176–185

  24. Coulom R. Efficient selectivity and backup operators in monte-carlo tree search. In: Proceedings of the 5th International Conference on Computer and Games. 2006, 72–83

  25. Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484–489

    Article  Google Scholar 

  26. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, et al. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354–359

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2017YFB1002104), the National Natural Science Foundation of China (Grant No. U1811461), and the Innovation Program of Institute of Computing Technology, CAS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ping Luo.

Additional information

Hongwei Li received the BE degree in software engineering from Fuzhou University, China in 2015 and now is a PhD student at the Institute of Computing Technology, Chinese Academy of Sciences, China. His research interests focus on machine learning, natural language processing and information extraction.

Yingpeng Hu received the BE degree in computer science and technology from University of Chinese Academy of Sciences at 2018 and now is a MS student at the Institute of Computing Technology, Chinese Academy of Sciences, China. His research interests focus on machine learning, natural language processing.

Yixuan Cao received the BE degree in transportation engineering from Tongji University, China in 2015 and now is a PhD student at the Institute of Computing Technology, Chinese Academy of Sciences, China. His research interests include natural language processing and information extraction.

Ganbin Zhou received the BE degree in software engineering from Dongbei University, China in 2013. He received his a PhD degree at the Institute of Computing Technology, Chinese Academe of Sciences, China at 2018. His research interests mainly focus on dialog systems, machine learning and data mining.

Ping Luo received the PhD degree in computer science from the Institute of Computing Technology, Chinese Academy of Sciences, China. He is an associate professor in the Institute of Computing Technology, Chinese Academy of Science, China. His general area of research is knowledge discovery and machine learning.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Hu, Y., Cao, Y. et al. Rich-text document styling restoration via reinforcement learning. Front. Comput. Sci. 15, 154328 (2021). https://doi.org/10.1007/s11704-020-9322-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-020-9322-7

Keywords

Navigation