A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments,Journal of Systems and Software

当前位置： X-MOL 学术 › J. Syst. Softw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments
Journal of Systems and Software ( IF 3.5 ) Pub Date : 2021-01-22 , DOI: 10.1016/j.jss.2021.110911
Mehdi Golzadeh , Alexandre Decan , Damien Legay , Tom Mens

Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 have been identified as bots. Using this dataset we propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a very high weighted average precision, recall and F1-score of 0.98 on a test set containing 40% of the data. We integrated the classification model into an open source command-line tool to allow practitioners to detect which accounts in a given Github repository actually correspond to bots.

中文翻译：

用于检测GitHub问题和PR注释中的bot的真实数据集和分类模型

在Github存储库中经常使用Bot来自动化重复性活动，这些活动是分布式软件开发过程的一部分。他们通过评论与人类演员进行交流。尽管出于多种原因而检测到它们的存在很重要，但没有可用的大型且有代表性的地面真实数据集，也没有基于这种数据集的分类模型来检测和验证机器人。本文基于具有高度人际协议的手动分析，提出了一个真实的数据集，该数据集包含5,000个不同的Github帐户中的拉取请求和发布注释，其中527个已被识别为机器人。利用此数据集，我们提出了一个自动分类模型来检测漫游器，并以每个帐户的空和非空注释数量，注释模式数量，以及评论模式中评论之间的不平等。在包含40％数据的测试集上，我们获得了0.98的非常高的加权平均精度，召回率和F1得分。我们将分类模型集成到开源命令行工具中，以允许从业人员检测给定Github存储库中的哪些帐户实际上与机器人相对应。

更新日期：2021-01-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>