TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation
arXiv - CS - Databases Pub Date : 2020-06-10 , DOI: arxiv-2006.06434
Ningyuan Sun, Xuefeng Yang, Yunfeng Liu

Parsing natural language to corresponding SQL (NL2SQL) with data driven approaches like deep neural networks attracts much attention in recent years. Existing NL2SQL datasets assume that condition values should appear exactly in natural language questions and the queries are answerable given the table. However, these assumptions may fail in practical scenarios, because user may use different expressions for the same content in the table, and query information outside the table without the full picture of contents in table. Therefore we present TableQA, a large-scale cross-domain Natural Language to SQL dataset in Chinese language consisting 64,891 questions and 20,311 unique SQL queries on over 6,000 tables. Different from exisiting NL2SQL datasets, TableQA requires to generalize well not only to SQL skeletons of different questions and table schemas, but also to the various expressions for condition values. Experiment results show that the state-of-the-art model with 95.1% condition value accuracy on WikiSQL only gets 46.8% condition value accuracy and 43.0% logic form accuracy on TableQA, indicating the proposed dataset is challenging and necessary to handle. Two table-aware approaches are proposed to alleviate the problem, the end-to-end approaches obtains 51.3% and 47.4% accuracy on the condition value and logic form tasks, with improvement of 4.7% and 3.4% respectively.

中文翻译：

TableQA：用于表感知 SQL 生成的大规模中文文本到 SQL 数据集

近年来，使用深度神经网络等数据驱动方法将自然语言解析为相应的 SQL (NL2SQL) 备受关注。现有的 NL2SQL 数据集假设条件值应该完全出现在自然语言问题中，并且查询是可以回答的。然而，这些假设在实际场景中可能会失效，因为用户可能对表中相同的内容使用不同的表达方式，并且在没有表中内容的全貌的情况下查询表外信息。因此，我们提出了 TableQA，这是一个大规模的跨域自然语言到中文 SQL 数据集，包含 64,891 个问题和 6,000 多个表上的 20,311 个独特的 SQL 查询。与现有的 NL2SQL 数据集不同，TableQA 不仅需要很好地泛化到不同问题和表模式的 SQL 骨架，还需要很好地泛化到条件值的各种表达式。实验结果表明，在 WikiSQL 上具有 95.1% 条件值准确率的最新模型在 TableQA 上仅获得 46.8% 条件值准确率和 43.0% 逻辑形式准确率，表明所提出的数据集具有挑战性且需要处理。提出了两种表感知方法来缓解该问题，端到端方法在条件值和逻辑表单任务上获得了 51.3% 和 47.4% 的准确率，分别提高了 4.7% 和 3.4%。TableQA 上的逻辑形式准确率为 0%，表明提出的数据集具有挑战性且需要处理。提出了两种表感知方法来缓解该问题，端到端方法在条件值和逻辑表单任务上获得了 51.3% 和 47.4% 的准确率，分别提高了 4.7% 和 3.4%。TableQA 上的逻辑形式准确率为 0%，表明提出的数据集具有挑战性且需要处理。提出了两种表感知方法来缓解该问题，端到端方法在条件值和逻辑表单任务上获得了 51.3% 和 47.4% 的准确率，分别提高了 4.7% 和 3.4%。

更新日期：2020-06-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文