Scalable Micro-planned Generation of Discourse from Structured Data,Computational Linguistics

当前位置： X-MOL 学术 › Comput. Linguist. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Scalable Micro-planned Generation of Discourse from Structured Data
Computational Linguistics ( IF 9.3 ) Pub Date : 2020-01-01 , DOI: 10.1162/coli_a_00363
Anirban Laha ₁ , Parag Jain ₂ , Abhijit Mishra ₃ , Karthik Sankaranarayanan ₃

Affiliation

We present a framework for generating natural language description from structured data such as tables; the problem comes under the category of data-to-text natural language generation (NLG). Modern data-to-text NLG systems typically employ end-to-end statistical and neural architectures that learn from a limited amount of task-specific labeled data, and therefore, exhibit limited scalability, domain-adaptability, and interpretability. Unlike these systems, ours is a modular, pipeline-based approach, and does not require task-specific parallel data. It rather relies on monolingual corpora and basic off-the-shelf NLP tools. This makes our system more scalable and easily adaptable to newer domains.Our system employs a 3-staged pipeline that: (i) converts entries in the structured data to canonical form, (ii) generates simple sentences for each atomic entry in the canonicalized representation, and (iii) combines the sentences to produce a coherent, fluent and adequate paragraph description through sentence compounding and co-reference replacement modules. Experiments on a benchmark mixed-domain dataset curated for paragraph description from tables reveals the superiority of our system over existing data-to-text approaches. We also demonstrate the robustness of our system in accepting other popular datasets covering diverse data types such as Knowledge Graphs and Key-Value maps.

中文翻译：

从结构化数据可扩展的微计划话语生成

我们提出了一个从结构化数据（如表格）生成自然语言描述的框架；该问题属于数据到文本自然语言生成 (NLG) 的范畴。现代数据到文本 NLG 系统通常采用端到端的统计和神经架构，这些架构从有限数量的任务特定标记数据中学习，因此表现出有限的可扩展性、域适应性和可解释性。与这些系统不同，我们的系统是模块化的、基于管道的方法，不需要特定于任务的并行数据。它更依赖于单语语料库和基本的现成 NLP 工具。这使我们的系统更具可扩展性，更容易适应新的领域。我们的系统采用 3 阶段管道：（i）将结构化数据中的条目转换为规范形式，(ii) 为规范化表示中的每个原子条目生成简单的句子，以及 (iii) 通过句子复合和共同引用替换模块组合句子以产生连贯、流畅和充分的段落描述。在针对表格中的段落描述而策划的基准混合域数据集上进行的实验揭示了我们的系统优于现有数据到文本方法的优越性。我们还展示了我们的系统在接受涵盖各种数据类型（如知识图谱和键值图）的其他流行数据集方面的稳健性。针对表中段落描述的基准混合域数据集的实验揭示了我们的系统优于现有数据到文本方法的优越性。我们还展示了我们的系统在接受涵盖不同数据类型（如知识图谱和键值图）的其他流行数据集方面的稳健性。在针对表格中的段落描述而策划的基准混合域数据集上进行的实验揭示了我们的系统优于现有数据到文本方法的优越性。我们还展示了我们的系统在接受涵盖不同数据类型（如知识图谱和键值图）的其他流行数据集方面的稳健性。

更新日期：2020-01-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>