Universal Representation for Code,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Universal Representation for Code
arXiv - CS - Machine Learning Pub Date : 2021-03-04 , DOI: arxiv-2103.03116
Linfeng Liu, Hoan Nguyen, George Karypis, Srinivasan Sengamedu

Learning from source code usually requires a large amount of labeled data. Despite the possible scarcity of labeled data, the trained model is highly task-specific and lacks transferability to different tasks. In this work, we present effective pre-training strategies on top of a novel graph-based code representation, to produce universal representations for code. Specifically, our graph-based representation captures important semantics between code elements (e.g., control flow and data flow). We pre-train graph neural networks on the representation to extract universal code properties. The pre-trained model then enables the possibility of fine-tuning to support various downstream applications. We evaluate our model on two real-world datasets -- spanning over 30M Java methods and 770K Python methods. Through visualization, we reveal discriminative properties in our universal code representation. By comparing multiple benchmarks, we demonstrate that the proposed framework achieves state-of-the-art results on method name prediction and code graph link prediction.

中文翻译：

代码的通用表示

从源代码学习通常需要大量的标记数据。尽管可能缺少标记的数据，但是经过训练的模型是高度特定于任务的，并且缺乏可转移到不同任务的能力。在这项工作中，我们提出了一种新颖的基于图形的代码表示形式的有效预训练策略，以产生通用的代码表示形式。具体来说，我们基于图形的表示形式捕获了代码元素之间的重要语义（例如，控制流和数据流）。我们对表示形式的图形神经网络进行预训练，以提取通用代码属性。然后，经过预训练的模型可以进行微调以支持各种下游应用程序。我们在两个现实世界的数据集上评估了我们的模型-跨越30M的Java方法和770K的Python方法。通过可视化，我们在通用代码表示中揭示了区分属性。通过比较多个基准，我们证明了所提出的框架在方法名称预测和代码图链接预测方面取得了最新的成果。

更新日期：2021-03-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>