Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser
arXiv - CS - Information Retrieval Pub Date : 2021-05-01 , DOI: arxiv-2105.00150
Yuta Koreeda, Christopher D. Manning

While many NLP papers, tasks and pipelines assume raw, clean texts, many texts we encounter in the wild are not so clean, with many of them being visually structured documents (VSDs) such as PDFs. Conventional preprocessing tools for VSDs mainly focused on word segmentation and coarse layout analysis, while fine-grained logical structure analysis (such as identifying paragraph boundaries and their hierarchies) of VSDs is underexplored. To that end, we proposed to formulate the task as prediction of transition labels between text fragments that maps the fragments to a tree, and developed a feature-based machine learning system that fuses visual, textual and semantic cues. Our system significantly outperformed baselines in identifying different structures in VSDs. For example, our system obtained a paragraph boundary detection F1 score of 0.951 which is significantly better than a popular PDF-to-text tool with a F1 score of 0.739.

中文翻译：

使用多模式转换解析器捕获视觉结构化文档的逻辑结构

尽管许多NLP论文，任务和管道都假定原始，干净的文本，但我们在野外遇到的许多文本并不是那么干净，其中许多是可视化结构的文档（VSD），例如PDF。用于VSD的常规预处理工具主要集中在分词和粗略布局分析上，而对VSD的细粒度逻辑结构分析（例如，识别段落边界及其层次结构）则缺乏研究。为此，我们提出将任务表述为文本片段之间的过渡标签预测（将片段映射到树），并开发了基于功能的机器学习系统，该系统融合了视觉，文本和语义提示。我们的系统在识别VSD中的不同结构方面明显优于基线。例如，我们的系统获得的段落边界检测F1得分为0。

更新日期：2021-05-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文