Unsupervised Translation of Programming Languages,arXiv - CS - Programming Languages

当前位置： X-MOL 学术 › arXiv.cs.PL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Translation of Programming Languages
arXiv - CS - Programming Languages Pub Date : 2020-06-05 , DOI: arxiv-2006.03511
Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample

A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

中文翻译：

编程语言的无监督翻译

转编译器，也称为源代码到源代码转换器，是一种将源代码从高级编程语言（例如 C++ 或 Python）转换为另一种语言的系统。转编译器主要用于互操作性，并将用过时或过时的语言（例如 COBOL、Python 2）编写的代码库移植到现代语言中。它们通常依赖于手工编写的重写规则，应用于源代码抽象语法树。不幸的是，由此产生的翻译通常缺乏可读性，不遵守目标语言约定，并且需要手动修改才能正常工作。整个翻译过程非常耗时，需要源语言和目标语言的专业知识，这使得代码翻译项目成本高昂。尽管神经模型在自然语言翻译的上下文中明显优于基于规则的对应模型，但由于该领域的并行数据稀缺，它们在转编译中的应用受到限制。在本文中，我们建议利用无监督机器翻译中的最新方法来训练完全无监督的神经转译器。我们在开源 GitHub 项目的源代码上训练我们的模型，并表明它可以在 C++、Java 和 Python 之间高精度地转换函数。我们的方法完全依赖于单语源代码，不需要源语言或目标语言的专业知识，并且可以很容易地推广到其他编程语言。我们还构建并发布了一个由 852 个并行函数组成的测试集，以及用于检查翻译正确性的单元测试。

更新日期：2020-09-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文