Leveraging machine learning for software redocumentation—A comprehensive comparison of methods in practice,Software: Practice and Experience

当前位置： X-MOL 学术 › Softw. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Leveraging machine learning for software redocumentation—A comprehensive comparison of methods in practice
Software: Practice and Experience ( IF 3.5 ) Pub Date : 2020-11-10 , DOI: 10.1002/spe.2933
Verena Geist ₁ , Michael Moser ₁ , Josef Pichler ₂ , Rodolfo Santos ₃ , Volkmar Wieser ₃

Affiliation

Source code comments contain key information about the underlying software system. Many redocumentation approaches, however, cannot exploit this valuable source of information. This is mainly due to the fact that not all comments have the same goals and target audience and can therefore only be used selectively for redocumentation. Performing a required classification manually, for example, in the form of heuristics, is usually time‐consuming and error‐prone and strongly dependent on programming languages and guidelines of concrete software systems. By leveraging machine learning (ML), it should be possible to classify comments and thus transfer valuable information from the source code into documentation with less effort but the same quality. We applied classical ML techniques but also deep learning (DL) approaches to legacy systems by transferring source code comments into meaningful representations using, for example, word embeddings but also novel approaches using quick response codes or a special character‐to‐image encoding. The results were compared with industry‐strength heuristic classification. As a result, we found that ML outperforms the heuristics in number of errors and less effort, that is, we finally achieve an accuracy of more than 95% for an image‐based DL network and even over 96% for a traditional approach using a random forest classifier.

中文翻译：

利用机器学习进行软件重新记录—实际方法的全面比较

源代码注释包含有关基础软件系统的关键信息。但是，许多重新记录方法无法利用这种有价值的信息源。这主要是由于并非所有评论都具有相同的目的和目标受众，因此只能有选择地用于重新记录。手动执行所需的分类（例如，以试探法的形式）通常很耗时且容易出错，并且强烈依赖于编程语言和具体软件系统的准则。通过利用机器学习（ML），应该可以对注释进行分类，从而以较少的工作量和相同的质量将有价值的信息从源代码转移到文档中。我们将经典的ML技术以及深度学习（DL）方法应用于遗留系统，通过使用例如单词嵌入将源代码注释转换为有意义的表示形式，还使用了快速响应代码或特殊的字符到图像编码的新颖方法。将结果与行业强度启发式分类进行比较。结果，我们发现ML在错误数量和工作量方面均优于启发式算法，也就是说，对于基于图像的DL网络，我们最终实现了95％以上的准确性，对于使用随机森林分类器。将结果与行业强度启发式分类进行比较。结果，我们发现ML在错误数量和工作量方面均优于启发式算法，也就是说，对于基于图像的DL网络，我们最终实现了95％以上的准确性，对于使用随机森林分类器。将结果与行业强度启发式分类进行比较。结果，我们发现ML在错误数量和工作量方面均优于启发式算法，也就是说，对于基于图像的DL网络，我们最终实现了95％以上的准确性，对于使用随机森林分类器。

更新日期：2020-11-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>