当前位置: X-MOL 学术arXiv.cs.HC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Making Table Understanding Work in Practice
arXiv - CS - Human-Computer Interaction Pub Date : 2021-09-11 , DOI: arxiv-2109.05173
Madelon Hulsebos, Sneha Gathani, James Gale, Isil Dillig, Paul Groth, Çağatay Demiralp

Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap between the performance of these models on these benchmarks and their applicability in practice. In this paper, we address the question: what do we need for these models to work in practice? We discuss three challenges of deploying table understanding models and propose a framework to address them. These challenges include 1) difficulty in customizing models to specific domains, 2) lack of training data for typical database tables often found in enterprises, and 3) lack of confidence in the inferences made by models. We present SigmaTyper which implements this framework for the semantic column type detection task. SigmaTyper encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model. Lastly, we highlight avenues for future research that further close the gap towards making table understanding effective in practice.

中文翻译:

使表格理解在实践中发挥作用

大规模理解表的语义对于数据集成、准备和搜索等任务至关重要。表格理解方法旨在检测表格的主题、语义列类型、列关系或实体。随着深度学习的兴起,已经为这些任务开发了强大的模型,在基准测试中具有出色的准确性。但是,我们观察到这些模型在这些基准上的性能与其在实践中的适用性之间存在差距。在本文中,我们解决了一个问题:这些模型在实践中需要什么?我们讨论了部署表格理解模型的三个挑战,并提出了一个框架来解决这些挑战。这些挑战包括 1) 难以针对特定领域定制模型,2) 缺乏企业中常见的典型数据库表的训练数据,以及 3) 对模型所做的推断缺乏信心。我们提出了 SigmaTyper,它为语义列类型检测任务实现了这个框架。SigmaTyper 封装了一个在 GitTables 上训练的混合模型,并集成了一个轻量级的人在环方法来自定义模型。最后,我们强调了未来研究的途径,这些途径进一步缩小了使表格理解在实践中有效的差距。
更新日期:2021-09-14
down
wechat
bug