The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service
arXiv - CS - Information Retrieval Pub Date : 2019-11-22 , DOI: arxiv-1911.09969
Meng Chen, Ruixue Liu, Lei Shen, Shaozu Yuan, Jingyan Zhou, Youzheng Wu, Xiaodong He, Bowen Zhou

Human conversations are complicated and building a human-like dialogue agent is an extremely challenging task. With the rapid development of deep learning techniques, data-driven models become more and more prevalent which need a huge amount of real conversation data. In this paper, we construct a large-scale real scenario Chinese E-commerce conversation corpus, JDDC, with more than 1 million multi-turn dialogues, 20 million utterances, and 150 million words. The dataset reflects several characteristics of human-human conversations, e.g., goal-driven, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and question-answering. Extra intent information and three well-annotated challenge sets are also provided. Then, we evaluate several retrieval-based and generative models to provide basic benchmark performance on the JDDC corpus. And we hope JDDC can serve as an effective testbed and benefit the development of fundamental research in dialogue task

中文翻译：

JDDC语料库：面向电子商务客服的大规模多轮中文对话数据集

人类对话很复杂，构建一个类似人类的对话代理是一项极具挑战性的任务。随着深度学习技术的快速发展，数据驱动模型变得越来越普遍，需要大量的真实对话数据。在本文中，我们构建了一个大规模的真实场景中文电子商务对话语料JDDC，拥有超过100万个多轮对话、2000万条话语和1.5亿个单词。该数据集反映了人与人对话的几个特征，例如，目标驱动和上下文之间的长期依赖性。它还涵盖了各种对话类型，包括面向任务、闲聊和问答。还提供了额外的意图信息和三个注释良好的挑战集。然后，我们评估了几个基于检索和生成模型，以在 JDDC 语料库上提供基本的基准性能。我们希望 JDDC 可以作为一个有效的测试平台，并有利于对话任务基础研究的发展

更新日期：2020-03-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文