当前位置: X-MOL 学术Knowl. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning cell embeddings for understanding table layouts
Knowledge and Information Systems ( IF 2.5 ) Pub Date : 2020-09-07 , DOI: 10.1007/s10115-020-01508-6
Majid Ghasemi-Gol , Jay Pujara , Pedro Szekely

There is a large amount of data on the web in tabular form, such as Excel sheets, CSV files, and web tables. Often, tabular data is meant for human consumption, using data layouts that are difficult for machines to interpret automatically. Previous work uses the stylistic features of tabular cells (such as font size, border type, and background color) to classify tabular cells by their role in the data layout of the document (top attribute, data, metadata, etc.). In this paper, we propose a deep neural network model which can embed semantic and contextual information about tabular cells in a low-dimensional cell embedding space. We pre-train this cell embedding model on a large corpus of tabular documents from various domains. We then propose a classification technique based on recurrent neural networks (RNNs) to use our pre-trained cell embeddings, combining them with stylistic features introduced in previous work, in order to improve the performance of cell type classification in complex documents. We evaluate the performance of our system on three datasets containing documents with various data layouts, in two settings: in-domain and cross-domain training. Our evaluation result shows that our proposed cell vector representations in combination with our RNN-based classification technique significantly improve cell type classification performance.



中文翻译:

学习单元格嵌入以了解表格布局

表格中的网络上有大量数据,例如Excel表格,CSV文件和Web表格。通常,表格数据是供人类使用的,其使用的数据布局使机器难以自动解释。先前的工作使用表格单元格的样式特征(例如字体大小,边框类型和背景色),根据表格单元格在文档数据布局中的作用(顶部属性,数据,元数据等)对表格单元格进行分类。在本文中,我们提出了一种深度神经网络模型,该模型可以将关于表格单元格的语义和上下文信息嵌入到低维单元格嵌入空间中。我们在来自不同领域的大量表格文档中对该单元嵌入模型进行了预训练。然后,我们提出一种基于递归神经网络(RNN)的分类技术,以使用我们的预训练细胞嵌入,并将它们与先前工作中介绍的样式特征结合起来,以提高复杂文档中细胞类型分类的性能。我们在两个设置中对域内和跨域训练中的三个数据集(包含具有各种数据布局的文档)评估系统的性能。我们的评估结果表明,我们提出的细胞矢量表示法与基于RNN的分类技术相结合,可以显着提高细胞类型的分类性能。我们在两个设置中对域内和跨域训练中的三个数据集(包含具有各种数据布局的文档)评估系统的性能。我们的评估结果表明,我们提出的细胞矢量表示法与基于RNN的分类技术相结合,可以显着提高细胞类型的分类性能。我们在两个设置中对域内和跨域训练中的三个数据集(包含具有各种数据布局的文档)评估系统的性能。我们的评估结果表明,我们提出的细胞矢量表示法与基于RNN的分类技术相结合,可以显着提高细胞类型的分类性能。

更新日期:2020-09-08
down
wechat
bug