CodeQA: A Question Answering Dataset for Source Code Comprehension,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

CodeQA: A Question Answering Dataset for Source Code Comprehension
arXiv - CS - Computation and Language Pub Date : 2021-09-17 , DOI: arxiv-2109.08365
Chenxiao Liu, Xiaojun Wan

We propose CodeQA, a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs. To obtain natural and faithful questions and answers, we implement syntactic rules and semantic analysis to transform code comments into question-answer pairs. We present the construction process and conduct systematic analysis of our dataset. Experiment results achieved by several neural baselines on our dataset are shown and discussed. While research on question-answering and machine reading comprehension develops rapidly, few prior work has drawn attention to code question answering. This new dataset can serve as a useful research benchmark for source code comprehension.

中文翻译：

CodeQA：用于源代码理解的问答数据集

我们提出 CodeQA，这是一个自由形式的问答数据集，用于源代码理解：给定代码片段和问题，需要生成文本答案。CodeQA 包含一个包含 119,778 个问答对的 Java 数据集和一个包含 70,085 个问答对的 Python 数据集。为了获得自然、忠实的问答，我们通过句法规则和语义分析将代码注释转化为问答对。我们介绍了构建过程并对我们的数据集进行了系统分析。显示和讨论了通过我们数据集上的几个神经基线获得的实验结果。虽然问答和机器阅读理解的研究发展迅速，但很少有先前的工作引起人们对代码问答的关注。

更新日期：2021-09-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文