当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting
arXiv - CS - Information Retrieval Pub Date : 2020-07-07 , DOI: arxiv-2007.07082
Vladimir Bernstein and Andrei Afanassenkov

Processing large amounts of data is an essential problem of the big data era. Most of the data exchange is done via direct communication (using APIs) and well-structured file formats (JSON, XML, EDI, etc.), but a significant portion of the data is transferred using arbitrary formatted computer-generated documents (such as invoices, purchase orders, financial reports, etc.), which require sophisticated processing and human intervention for data interpretation and extraction. The currently available solutions, ranging from manual data entry to low-level scripting and data extraction tools, are costly and require human intervention. This paper describes the principle methodology for unsupervised, fully automatic data extraction from a wide range of computer-generated documents, assuming that their formatting reflects the original structure of the data sources. The presented methodology falls into the category of unsupervised machine learning and consists of the three main parts: (1) - detecting repeating patterns of text formatting by employing the relative feature space clustering and adaptive weighted feature score maps, (2) - detecting hierarchical formatting structures via collapsing and noise filtering procedure applied to the repeating formatting patterns and (3) - automatic configuration of the interactive data extraction tool (SiMX TextConverter) for fully automated processing.

中文翻译:

使用单行格式从计算机生成的文档中提取无监督数据

海量数据的处理是大数据时代的本质问题。大多数数据交换是通过直接通信(使用 API)和结构良好的文件格式(JSON、XML、EDI 等)完成的,但很大一部分数据是使用任意格式的计算机生成文档(例如发票、采购订单、财务报告等),这些数据解释和提取需要复杂的处理和人工干预。当前可用的解决方案,从手动数据输入到低级脚本和数据提取工具,成本高昂且需要人工干预。本文描述了从各种计算机生成的文档中进行无监督、全自动数据提取的原理方法,假设它们的格式反映了数据源的原始结构。所提出的方法属于无监督机器学习的范畴,由三个主要部分组成:(1) - 通过使用相对特征空间聚类和自适应加权特征评分图检测文本格式的重复模式,(2) - 检测分层格式通过应用于重复格式化模式的折叠和噪声过滤程序构建结构,以及 (3) - 交互式数据提取工具 (SiMX TextConverter) 的自动配置,以实现全自动处理。
更新日期:2020-07-17
down
wechat
bug