An improved text classification modelling approach to identify security messages in heterogeneous projects,Software Quality Journal

当前位置： X-MOL 学术 › Software Qual. J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An improved text classification modelling approach to identify security messages in heterogeneous projects
Software Quality Journal ( IF 1.7 ) Pub Date : 2021-05-27 , DOI: 10.1007/s11219-020-09546-7
Tosin Daniel Oyetoyan , Patrick Morrison

Security remains under-addressed in many organisations, illustrated by the number of large-scale software security breaches. Preventing breaches can begin during software development if attention is paid to security during the software’s design and implementation. One approach to security assurance during software development is to examine communications between developers as a means of studying the security concerns of the project. Prior research has investigated models for classifying project communication messages (e.g., issues or commits) as security related or not. A known problem is that these models are project-specific, limiting their use by other projects or organisations. We investigate whether we can build a generic classification model that can generalise across projects. We define a set of security keywords by extracting them from relevant security sources, dividing them into four categories: asset, attack/threat, control/mitigation, and implicit. Using different combinations of these categories and including them in the training dataset, we built a classification model and evaluated it on industrial, open-source, and research-based datasets containing over 45 different products. Our model based on harvested security keywords as a feature set shows average recall from 55 to 86%, minimum recall from 43 to 71% and maximum recall from 60 to 100%. An average f-score between 3.4 and 88%, an average g-measure of at least 66% across all the dataset, and an average AUC of ROC from 69 to 89%. In addition, models that use externally sourced features outperformed models that use project-specific features on average by a margin of 26–44% in recall, 22–50% in g-measure, 0.4–28% in f-score, and 15–19% in AUC of ROC. Further, our results outperform a state-of-the-art prediction model for security bug reports in all cases. We find using sound statistical and effect size tests that (1) using harvested security keywords as features to train a text classification model improve classification models and generalise to other projects significantly. (2) Including features in the training dataset before model construction improve classification models significantly. (3) Different security categories represent predictors for different projects. Finally, we introduce new and promising approaches to construct models that can generalise across different independent projects.

中文翻译：

一种改进的文本分类建模方法，用于识别异构项目中的安全消息

在许多组织中，安全性仍未得到充分解决，大规模软件安全漏洞的数量就说明了这一点。如果在软件的设计和实施过程中注意安全性，则可以在软件开发期间开始防止破坏。在软件开发过程中进行安全性保证的一种方法是检查开发人员之间的通信，作为研究项目安全性问题的一种方法。先前的研究已经研究了用于将项目通信消息（例如，问题或提交）分类为是否与安全相关的模型。一个已知的问题是这些模型是特定于项目的，从而限制了它们在其他项目或组织中的使用。我们调查是否可以建立可以在项目之间进行概括的通用分类模型。我们通过从相关安全性源中提取安全性关键字来定义一组安全性关键字，将它们分为四类：资产，攻击/威胁，控制/缓解和隐式。使用这些类别的不同组合并将它们包括在训练数据集中，我们建立了一个分类模型，并在包含45种以上不同产品的工业，开放源代码和基于研究的数据集上对其进行了评估。我们基于收获的安全关键字作为功能集的模型显示平均召回率从55％到86％，最小召回率从43％到71％，最大召回率从60％到100％。在所有数据集中，平均f得分在3.4％到88％之间，平均g量度至少为66％，ROC的平均AUC从69％到89％。此外，使用外部来源特征的模型的平均表现优于使用项目特定特征的模型，其召回率平均为26–44％，g量度为22–50％，f分数为0.4–28％和15–19％在中华民国的AUC中。此外，我们的结果在所有情况下都优于最新的安全漏洞报告预测模型。我们发现使用合理的统计和效果量测验（1）使用收获的安全关键字作为特征来训练文本分类模型，可以改进分类模型并显着推广到其他项目。（2）在模型构建之前将训练数据集中的特征包括在内，可以显着改善分类模型。（3）不同的安全类别表示不同项目的预测变量。最后，

更新日期：2021-05-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11