Efficient feature extraction model for validation performance improvement of duplicate bug report detection in software bug triage systems
Introduction
Currently, a significant time- and cost-consuming phase of software engineering is software maintenance, where finding and handling the bugs and managing the changes are the most critical tasks [1]. Many efforts are directed at automatic bug detection using software testing approaches, such as static and dynamic testing, white and black box testing, and other testing strategies [1], [2], [3]. In addition to software testing, bugs reported by end-users should be explored because bug reports reveal bugs that were undetected in the software-testing phase. Furthermore, they improve the user experience of the software and update it for new end-user requirements. Software bug triage systems like Bugzilla are used to maintain the software, especially for receiving bug reports from end-users and for when the software has crashed. The bug reports should be prioritized, categorized [4], and handled by triagers until they can be assigned to developers [5]. There are many challenges in the bug report domain, e.g., prioritizing bug reports is particularly important, especially some bug reports like security bug reports can directly impact end-user attrition rates, which should be determined as soon as it is possible. There are many efforts toward an automatic determination of security bug reports [6]. Moreover, the severity of each bug report should be predicted, which can be triaged as soon as possible [7].
A complex process in bug report handling by triagers is the detection of duplicate bug reports (DBRD), which accounts for up to 70% of reports in the repository of the bug triage systems, especially for open source projects with large end-user communities [8]. There are two steps in DBRD, which also represent significant challenges. The first challenge is a feature extraction method that finds the most efficient features from bug reports. The second challenge is a duplicate detection method, requiring a predictor or classifier model. There are many studies on these two steps with the aim of improving the performance of DBRD. The first challenge is more critical because extracting the most efficient features can help with better duplicate detection; for example, it is not possible to detect apples versus oranges just by length, even though it could be achieved by color alone. Also, the volume is redundant when there exist length, width, and height. Therefore, better independent features can be helpful for duplicate detection. This study focuses on finding features that are more useful by including state-of-the-art features to improve the performance of DBRD.
There are many fields in bug reports; thus, many types of features can be extracted. Table 1 describes the common data fields of most software bug triage systems like Bugzilla [9].Additionally, these systems can be customized by triagers to contain more data fields; some data fields may have different names for the same concept, e.g., between Android Issue Tracker and Bugzilla. It is essential that there is a chain of duplicate bug reports, for example, bug report 1 (BR1) can be a duplicate of BR2, where BR2 is called the master bug report of BR1, and BR2 can be a duplicate of BR3. However, there is always a bug report in this chain that does not have a Merged Id and is therefore null. Thus, a tree of duplicate bug reports can be made based on these duplication chains, and there is a forest of duplication trees in bug report repositories, which are called duplicate buckets. The duplicate bug reports can use the bug id of the master bug report as a master id to detect their corresponding bucket [10].
The bug repositories of software triage systems are semi-structured. Some data fields are structured, such as identical, temporal, and categorical, whereas others are unstructured such as text and files. Structured data have numerable attributes with a single value in a specific range, such as tables in databases [11]. The attributes of structured data can be nominal, ordinal, or numeric values. Text, images, graphs, and file fields are unstructured data [12]. A high-performance query in databases cannot use unstructured data.
Further, it is time-consuming to find values in the content of unstructured data, compare them, or perform arithmetic or relational operations. Therefore, many similarity measurement metrics (such as cosine or Manhattan equations) also cannot be used. Unstructured data must be converted to structured data, which can be used easily by various operators [13]. This process is called feature extraction, which is used in many fields of study, including image processing and text mining [14]. There are many techniques in information retrieval for feature extraction based on the content of unstructured data, many of which will be described in Section II. This study focuses on a novel feature extraction method for introducing a new model for text fields, which are the most critical in DBRD.
Duplication is a binary operator; thus, it is not possible to classify a bug report as unique or duplicate without considering other bug reports. The duplicate detection process needs to compare the suspicious duplicate bug report with all bug reports in the repository and check every pair separately. Thus, the duplication label is always for two bug reports. The opposite of duplication is uniqueness, which is an operator in databases similar to the “distinct” operator in the “select” statement in structured query language –known as SQL-. The distinct operator considers the equality of all fields in both records, which is not possible in bug repositories because, in this context, textual fields contain the same concept with different instances. Therefore, simple equality is not helpful, and it is essential to consider various types of equality or similarity, known as features. After extracting these features, it is difficult to judge duplication; hence, predictive machine learning algorithms such as classifiers can be used to distinguish duplicates from non-duplicates. There have been many efforts in this phase to generate heuristic models, even though the classic machine learning algorithms have no problems and can detect duplications. The need for introducing new models when the old ones perform well and have proof of such performance is also considered in this research.
The primary issues in DBRD are as follows: (I) validation performance and (II) runtime performance. Validation performance refers to improving the accuracy, precision, recall, and identical metrics of duplicate detection, which indicate how many bug reports can be classified as duplicates and help the triagers. For example, according to a recent report [15], the recall rate was less than 80% for Eclipse, Firefox, and Open Office bug repositories. A new study [16] in DBRD shows a recall rate of less than 92% in the best situation and suggesting that 20 bug reports sent to triagers were not acceptable. In the case of 10 bug reports, the recall rate was below 90%. The runtime performance refers to the time and speed of the process of duplicate detection. When using DBRD offline, triagers want to find duplicate bug reports in a repository and group them. After grouping the current bug reports and making the repository clean and ready, the DBRD problem will be important for new bug reports for which the online DBRD should be used. Online DBRD addresses issues (I) and (II) simultaneously [17,18].
However, the traditional approaches used in the online DBRD need to be reviewed and optimized for (i) feature extraction, (ii) building a model for duplicate detection, and (iii) runtime performance. For example, some features by the DBRD approaches are time-consuming and should be replaced by some new simple features with low time complexity [15,19,20]. The main goal of this study is to improve the validation performance of DBRD by introducing a new model (Fig. 2) with new features.
Many feature extraction methods have been used to build numerical values from bug reports that describe their similarity. One of these is the BM25F model, which is an overall weighted average of term frequency (TF) and inverse document frequency (IDF) for all standard terms in a pair of duplicate suspicious reports. The central question of this research is on the BM25F model and the TF-IDF features to find if other aggregate functions of TF and IDF can be more useful and meaningful for DBRD instead of the weighted average of TF and IDF.
Another question of this research is on how we can find which feature is more efficient for validation performance. The problem of feature efficiency detection is similar to the dimension reduction problem in data mining, which has many heuristic and meta-heuristic approaches. This research uses an overall average of prior normalized heuristic metrics as a new hybrid metric to find the efficiency of each feature and pre-validate the new proposed features before experimental evaluation. It is necessary to find the efficiency of information of each feature to avoid introducing new additional useless features that have not yet been considered by previous researchers but are not significantly useful for validation performance. Thus, the main contributions of this study are as follows:
- ➢
Proposing a feature extraction model by considering all types of feature extraction techniques
- ➢
Proposing new textual features based on TF-IDF features for validation performance improvement
- ➢
Introducing a new hybrid heuristic metric for validating the efficiency ofour proposed and the state-of-the-artfeatures in the DBRD process
- ➢
Improving the performance of the duplicate bug report detector using different types of features withour introduced baseline model to evaluate proposed features
- ➢
Finding the best machine learning algorithm as a duplicate detector for bug reports
- ➢
Conducting statistical tests to ensurethe reproducibility of similar results for un-reviewedbug reports
This paper is organized as follows: Section II reviews the state-of-the-art in feature extraction from bug reports, and the techniques for detecting useful features. Section III describes the research methodology of DBRD. Section IV demonstrates our proposed features and uses our proposed method of detecting efficient features to find the quality of proposed features. Section V evaluates the proposed methods, and Section VI concludes the paper and envisions future work.
Section snippets
Literature review
As mentioned in the introduction, bug reports are a semi-structured software repository. It is challenging to compare unstructured data, such as text, images, and sounds. A usual method is to convert unstructured to structured data using feature extraction, where every feature represents a property of the primary data. Various multiple features can describe primary data more effectively. For example, histograms, representing the frequency of each color in the image, are useful in comparing two
Feature extraction method
Features can be categorized as data fields of bug reports where each category describes a new aspect of a bug report and is extracted using a unique technique. The main idea behind every feature extraction method is the comparison of a pair of bug reports based on their fields. There are many analysis methods, such as equality comparison or subtracting data fields. Based on a history of their usage, features may be divided into five categories [9]: (1) Textual, (2) Temporal, (3) Structural, (4)
Methods of detecting efficient features
Efficient feature discovery is a common issue in dimension reduction problems and in some metric-based classifiers such as decision tree (DT) or rule-based machine learning algorithms. There are some evident and common techniques for useless feature discovery, such as removing features with lower standard deviations (near to zero) or with the highest typical value (near 100%). Moreover, some features are correlated, meaning they both describe the same behavior, such as the length and volume of
Research methodology
Fig. 1 depicts the methodology of DBRD as a data flow diagram (DFD), which was reproduced from many related works [9,15,20,30,36,52,67]. The cylinder box, ellipse box, and rectangular boxrefer to repositories, middle going data, and processes, respectively. In this DFD, the dashed line refers to the primary input and output of the methodology. This DFD describes the raw bug reports (box 1) that will be pre-processed (box 2) to make cleaned bug reports (box 3) in the first step, as mentioned in
Proposed Features and Method of Detecting Efficient Features
There are four significant contributions of this study, which were mentioned in the last paragraph of Section I and are further elaborated below:
- ➢
Proposing new textual features based on TF-IDF features for the feature extraction phase (box 6) in the methodology (Fig. 1).
- ➢
Introducing a new hybrid heuristic metric for validating the efficiency of our proposed andthestate-of-the-art features in the DBRD process, which can be performedafter the feature extraction phase (box 6) in the methodology (
Evaluation and experimental results
The methodology of duplicate detection was depicted in Fig. 1; however, there are some considerations in this process. If there are n bug reports in the database, there are C(n,2) = n × (n - 1)/2 combinations to check, some of which have a duplicate label, and others are non-duplicate. The total duplicate pairs in the bug reports of the database were selected, and between 1.6 to 2.8 times pairs were chosen for non-duplicate labels, randomly for each dataset, as shown in Table 6. It should be
Conclusion
In this study, we proposed a new feature extraction model, some new features based on the aggregation values (min, max, and simple average) of the TF and IDF, and a heuristic approach to detecting the efficiency of our newly proposed features. Our proposed features have no computational overhead and retain the time complexity of the previous feature extraction process. The experimental results showed that our proposed features, compared with the state-of-the-art ones, achieved, on average, 2%
Credit author statement
The first and second authors participated in formal analysis and the validation of the porposed method. The third author edited the paper and verified the proposed method. Moreover, the second author Babamir is the supervisor of the project whose output is this paper.
Declaration of Competing Interest
The authors declare that they have no known competing financil interests or personal relationships that could have appeared to influence the work reported this paper. The authors declare the followingfinancil interests o/ personal relationships which may be considered as potentil competing interests.
Acknowledgement
The authors would like to thank the University of Kashan for supporting this study under Grant #23451.
References (86)
- et al.
A bug finder refined by a large set of open-source projects
Inf. Softw. Technol.
(2019) - et al.
Not all bug reopens are negative: a case study on eclipse bug reports
Inf Softw Technol
(2018) - et al.
A component recommender for bug reports using Discriminative Probability Latent Semantic Analysis
Inf Softw Technol
(2016) - et al.
Enhancements for duplication detection in bug reports with manifold correlation features
Journal of Systems and Software
(2016) - et al.
Bug report severity level prediction in open source software: a survey and research opportunities
Inf Softw Technol
(2019) - et al.
Automating hierarchical document classification for construction management information systems
Automation in Construction
(2003) - et al.
Conceptual-model-based data extraction from multiple-record web pages
Data Knowl Eng
(1999) - et al.
Automated triaging of very large bug repositories
Inf Softw Technol
(2017) - et al.
An HMM-based approach for automatic detection and classification of duplicate bug reports
Inf Softw Technol
(2019) - et al.
Bug Report Triaging Using Textual, Categorical and Contextual Features Using Latent Dirichlet Allocation
International Journal for Innovative Research in Science and Technology (IJIRST)
(2015)
A novel probabilistic feature selection method for text classification
Knowl Based Syst
Chapter 7 - Concepts of Hypothesis Testing and Types of Errors
Dosage Form Design Parameters
Software Engineering
The bug report duplication problem: an exploratory study
Software Quality Journal
Methods of Feature Extraction for Detecting the Duplicate Bug Reports in Software Triage Systems
A discriminative model approach for accurate duplicate bug report retrieval
Content-based recommendation systems
The Adaptive Web
Ontology-based extraction and structuring of information from data-rich unstructured documents
Preventing duplicate bug reports by continuously querying bug reports
Empirical Software Engineering
Stopping duplicate bug reports before they start with Continuous Querying for bug reports
Automated duplicate bug report classification using subsequence matching
Improving Performance of Automatic Duplicate Bug Reports Detection Using Longest Common Sequence
Duplicate Detection Models for Bug Reports of Software Triage Systems: a Survey
Current Trends In Computer Sciences & Applications
Automatic Duplicate Bug Report Detection using Information Retrieval-based versus Machine Learning-based Approaches
Towards more accurate retrieval of duplicate bug reports
Automatic Typos Detection in Bug Reports
Automatic Interconnected Lexical Typo Correction in Bug Reports of Software Triage Systems
Fast Language-Independent Correction of Interconnected Typos to Finding Longest Terms
New labeled dataset of interconnected lexical typos for automatic correction in the bug reports
SN Applied Sciences
Combining word embedding with information retrieval to recommend similar bug reports
DWEN: deep word embedding network for duplicate bug report detection in software repositories
Detection of duplicate defect reports using natural language processing
Improving the accuracy of duplicate bug report detection using textual similarity measures
Improving bug localization using correlations in crash reports
An approach to detecting duplicate bug reports using natural language and execution information
Proceedings of the 30th InternationalConferenceOn SoftwareEngineering
Crash graphs: A n-aggregatedview of multiple crashes to improve crash triage
Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on
A Contextual Approach Towards More Accurate Duplicate Bug Report Detection
Proceedings of the 10th Working ConferenceOn Mining Software Repositories
Duplicate bug report detection with a combination of information retrieval and topic modeling
Proceedings of the 27th IEEE/ACM International ConferenceOn Automated Software Engineering (ASE)
Assisted detection of duplicate bug reports, in: faculty of Graduate Studies(Computer Science)
The University Of British Columbia
Automated duplicate detection for bug tracking systems
IEEE International ConferenceOn Dependable Systems and Networks (DSN) With FTCS and DCC
Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs
Proceedings of the International ConferenceOn Advances in Computing, Communication and Control
Cited by (25)
Leveraging multi-level embeddings for knowledge-aware bug report reformulation
2023, Journal of Systems and SoftwareCASMS: Combining clustering with attention semantic model for identifying security bug reports
2022, Information and Software TechnologyCitation Excerpt :Finally, we conclude our work in Section 7. In the early phase, bug tracking systems are used to help engineers to identify duplicate bug reports [7–11], evaluate the severity or priority of bug reports [12–14], trace bug reports back to relevant source documents [15–19], analyze and predict the effort needed to fix software bugs [20], work on characteristics of software vulnerabilities [21,22], and evaluate the ability of code analysis tools to detect security vulnerabilities [23]. Most of these methods applied textual similarity metrics (e.g., cosine similarity) and machine learning methods (e.g., SVM and KNN) to extract textual information, while the sequential and semantic information have not been considered.
Mobile crowdsourced test report prioritization based on text and image understanding
2024, Journal of Software: Evolution and ProcessA Survey on Bug Deduplication and Triage Methods from Multiple Points of View
2023, Applied Sciences (Switzerland)