当前位置: X-MOL 学术arXiv.cs.PL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TreeBERT: A Tree-Based Pre-Trained Model for Programming Language
arXiv - CS - Programming Languages Pub Date : 2021-05-26 , DOI: arxiv-2105.12485
Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, Lei Lyu

Source code can be parsed into the abstract syntax tree (AST) based on defined syntax rules. However, in pre-training, little work has considered the incorporation of tree structure into the learning process. In this paper, we present TreeBERT, a tree-based pre-trained model for improving programming language-oriented generation tasks. To utilize tree structure, TreeBERT represents the AST corresponding to the code as a set of composition paths and introduces node position embedding. The model is trained by tree masked language modeling (TMLM) and node order prediction (NOP) with a hybrid objective. TMLM uses a novel masking strategy designed according to the tree's characteristics to help the model understand the AST and infer the missing semantics of the AST. With NOP, TreeBERT extracts the syntactical structure by learning the order constraints of nodes in AST. We pre-trained TreeBERT on datasets covering multiple programming languages. On code summarization and code documentation tasks, TreeBERT outperforms other pre-trained models and state-of-the-art models designed for these tasks. Furthermore, TreeBERT performs well when transferred to the pre-trained unseen programming language.

中文翻译:

TreeBERT:用于编程语言的基于树的预训练模型

可以基于定义的语法规则将源代码解析为抽象语法树(AST)。但是,在预训练中,很少有工作考虑将树结构合并到学习过程中。在本文中,我们提出TreeBERT,这是一个基于树的预训练模型,用于改进面向编程语言的生成任务。为了利用树结构,TreeBERT将与代码相对应的AST表示为一组合成路径,并引入节点位置嵌入。通过具有混合目标的树掩蔽语言建模(TMLM)和节点顺序预测(NOP)对模型进行训练。TMLM使用了根据树的特征设计的新颖屏蔽策略,以帮助模型理解AST并推断AST缺少的语义。有了NOP,TreeBERT通过学习AST中节点的顺序约束来提取语法结构。我们在涵盖多种编程语言的数据集上对TreeBERT进行了预培训。在代码摘要和代码文档编制任务上,TreeBERT优于其他经过预训练的模型和为这些任务而设计的最新模型。此外,TreeBERT在转移到未训练的预见编程语言时也表现良好。
更新日期:2021-05-27
down
wechat
bug