Cross-Language Code Search using Static and Dynamic Analyses,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cross-Language Code Search using Static and Dynamic Analyses
arXiv - CS - Software Engineering Pub Date : 2021-06-16 , DOI: arxiv-2106.09173
George Mathew, Kathryn T. Stolee

As code search permeates most activities in software development,code-to-code search has emerged to support using code as a query and retrieving similar code in the search results. Applications include duplicate code detection for refactoring, patch identification for program repair, and language translation. Existing code-to-code search tools rely on static similarity approaches such as the comparison of tokens and abstract syntax trees (AST) to approximate dynamic behavior, leading to low precision. Most tools do not support cross-language code-to-code search, and those that do, rely on machine learning models that require labeled training data. We present Code-to-Code Search Across Languages (COSAL), a cross-language technique that uses both static and dynamic analyses to identify similar code and does not require a machine learning model. Code snippets are ranked using non-dominated sorting based on code token similarity, structural similarity, and behavioral similarity. We empirically evaluate COSAL on two datasets of 43,146Java and Python files and 55,499 Java files and find that 1) code search based on non-dominated ranking of static and dynamic similarity measures is more effective compared to single or weighted measures; and 2) COSAL has better precision and recall compared to state-of-the-art within-language and cross-language code-to-code search tools. We explore the potential for using COSAL on large open-source repositories and discuss scalability to more languages and similarity metrics, providing a gateway for practical,multi-language code-to-code search.

中文翻译：

使用静态和动态分析的跨语言代码搜索

随着代码搜索渗透到软件开发中的大多数活动，代码到代码搜索应运而生，以支持使用代码作为查询并在搜索结果中检索相似的代码。应用包括用于重构的重复代码检测、用于程序修复的补丁识别和语言翻译。现有的代码到代码搜索工具依赖于静态相似性方法，例如令牌和抽象语法树 (AST) 的比较来近似动态行为，导致精度较低。大多数工具不支持跨语言代码到代码搜索，支持的工具依赖于需要标记训练数据的机器学习模型。我们提出了跨语言代码到代码搜索 (COSAL)，这是一种跨语言技术，它使用静态和动态分析来识别相似的代码，并且不需要机器学习模型。代码片段使用基于代码标记相似性、结构相似性和行为相似性的非支配排序进行排名。我们在 43,146 个 Java 和 Python 文件和 55,499 个 Java 文件的两个数据集上对 COSAL 进行了实证评估，发现 1) 基于静态和动态相似性度量的非支配排名的代码搜索与单一或加权度量相比更有效；2) 与最先进的语言内和跨语言代码到代码搜索工具相比，COSAL 具有更好的精度和召回率。我们探索了在大型开源存储库上使用 COSAL 的潜力，并讨论了对更多语言和相似性指标的可扩展性，为实用的多语言代码到代码搜索提供了一个途径。和行为相似。我们在 43,146 个 Java 和 Python 文件和 55,499 个 Java 文件的两个数据集上对 COSAL 进行了实证评估，发现 1) 基于静态和动态相似性度量的非支配排名的代码搜索与单一或加权度量相比更有效；2) 与最先进的语言内和跨语言代码到代码搜索工具相比，COSAL 具有更好的精度和召回率。我们探索了在大型开源存储库上使用 COSAL 的潜力，并讨论了对更多语言和相似性指标的可扩展性，为实用的多语言代码到代码搜索提供了一个途径。和行为相似。我们在 43,146 个 Java 和 Python 文件和 55,499 个 Java 文件的两个数据集上对 COSAL 进行了实证评估，发现 1) 基于静态和动态相似性度量的非支配排名的代码搜索与单一或加权度量相比更有效；2) 与最先进的语言内和跨语言代码到代码搜索工具相比，COSAL 具有更好的精度和召回率。我们探索了在大型开源存储库上使用 COSAL 的潜力，并讨论了对更多语言和相似性指标的可扩展性，为实用的多语言代码到代码搜索提供了一个途径。499个Java文件，发现1）基于静态和动态相似性度量的非支配排序的代码搜索比单一或加权度量更有效；2) 与最先进的语言内和跨语言代码到代码搜索工具相比，COSAL 具有更好的精度和召回率。我们探索了在大型开源存储库上使用 COSAL 的潜力，并讨论了对更多语言和相似性指标的可扩展性，为实用的多语言代码到代码搜索提供了一个途径。499个Java文件，发现1）基于静态和动态相似性度量的非支配排序的代码搜索比单一或加权度量更有效；2) 与最先进的语言内和跨语言代码到代码搜索工具相比，COSAL 具有更好的精度和召回率。我们探索了在大型开源存储库上使用 COSAL 的潜力，并讨论了对更多语言和相似性指标的可扩展性，为实用的多语言代码到代码搜索提供了一个途径。

更新日期：2021-06-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文