当前位置: X-MOL 学术arXiv.cs.DL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis
arXiv - CS - Digital Libraries Pub Date : 2020-08-28 , DOI: arxiv-2008.12828
Ge Zhang, Mike A. Merrill, Yang Liu, Jeffrey Heer, Tim Althoff

Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific toolkits. However, large corpora have remained unanalyzed in depth, as descriptive labels are absent and require expert domain knowledge to generate. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. We then evaluate the model on a new classification task for labeling computational notebook cells as stages in the data analysis process from data import to wrangling, exploration, modeling, and evaluation. We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplied heuristics and outperforms a suite of baselines. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. Focusing on notebooks with relationships to academic articles, we conduct the largest ever study of scientific code and find that notebook composition correlates with the citation count of corresponding papers.

中文翻译:

CORAL:使用弱监督变压器进行代码表示学习,用于分析数据分析

源代码的大规模分析,尤其是科学源代码,有望更好地理解数据科学过程,确定分析最佳实践,并为科学工具包的构建者提供见解。然而,大型语料库仍未得到深入分析,因为缺少描述性标签并且需要专业领域知识来生成。我们提出了一种新的基于弱监督转换器的架构,用于从抽象语法树和周围的自然语言注释计算代码的联合表示。然后,我们在一个新的分类任务上评估模型,用于将计算笔记本单元标记为数据分析过程中的各个阶段,从数据导入到整理、探索、建模和评估。我们展示了我们的模型,仅利用容易获得的弱监督,与专家提供的启发式方法相比,准确度提高了 38%,并优于一组基线。我们的模型使我们能够检查一组 118,000 个 Jupyter Notebook,以发现常见的数据分析模式。专注于与学术文章相关的笔记本,我们对科学代码进行了有史以来最大规模的研究,发现笔记本的构成与相应论文的引用计数相关。
更新日期:2020-09-01
down
wechat
bug