Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data
arXiv - CS - Computation and Language Pub Date : 2021-06-09 , DOI: arxiv-2106.05006
Moshe Hazoom, Vibhor Malik, Ben Bogin

Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

中文翻译：

Text-to-SQL in the Wild：基于堆栈交换数据的自然发生的数据集

大多数可用的语义解析数据集，包括成对的自然话语和逻辑形式，仅用于训练和评估自然语言理解系统的目的。因此，它们不包含人类询问他们需要或好奇的数据的任何自然发生的话语的丰富性和多样性。在这项工作中，我们发布了 SEDE，这是一个包含 12,023 对话语和 SQL 查询的数据集，这些数据是从 Stack Exchange 网站上的实际使用中收集的。我们表明这些对包含迄今为止很少在任何其他语义解析数据集中反映的各种现实世界的挑战，提出了一种基于比较适合现实世界查询的部分查询子句的评估指标，并进行了实验具有强大的基线，

更新日期：2021-06-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文