Efficient feature extraction model for validation performance improvement of duplicate bug report detection in software bug triage systems

https://doi.org/10.1016/j.infsof.2020.106344Get rights and content

Abstract

Context

There are many duplicate bug reports in the semi-structured software repository of various software bug triage systems. The duplicate bug report detection (DBRD) process is a significant problem in software triage systems.

Objective

The DBRD problem has many issues, such as efficient feature extraction to calculate similarities between bug reports accurately, building a high-performance duplicate detector model, and handling continuous real-time queries. Feature extraction is a technique that converts unstructured data to structured data. The main objective of this study is to improve the validation performance of DBRD using a feature extraction model.

Method

This research focuses on feature extraction to build a new general model containing all types of features. Moreover, it introduces a new feature extractor method to describe a new viewpoint of similarity between texts. The proposed method introduces new textual features based on the aggregation of term frequency and inverse document frequency of text fields of bug reports in uni-gram and bi-gram forms. Further, a new hybrid measurement metric is proposed for detecting efficient features, whereby it is used to evaluate the efficiency of all features, including the proposed ones.

Results

The validation performance of DBRD was compared for the proposed features and state-of-the-art features. To show the effectiveness of our model, we applied it and other related studies to DBRD of the Android, Eclipse, Mozilla, and Open Office datasets and compared the results. The comparisons showed that our proposed model achieved (i) approximately 2% improvement for accuracy and precision and more than 4.5% and 5.9% improvement for recall and F1-measure, respectively, by applying the linear regression (LR) and decision tree (DT) classifiers and (ii) a performance of 91%−99% (average ~97%) for the four metrics, by applying the DT classifier as the best classifier.

Conclusion

Our proposed features improved the validation performance of DBRD concerning runtime performance. The pre-processing methods (primarily stemming) could improve the validation performance of DBRD slightly (up to 0.3%), but rule-based machine learning algorithms are more useful for the DBRD problem. The results showed that our proposed model is more effective both for the datasets for which state-of-the-art approaches were effective (i.e., Mozilla Firefox) and those for which state-of-the-art approaches were less effective (i.e., Android). The results also showed that the combination of all types of features could improve the validation performance of DBRD even for the LR classifier with less validation performance, which can be implemented easily for software bug triage systems. Without using the longest common subsequence (LCS) feature, which is effective but time-consuming, our proposed features could cover the effectiveness of LCS with lower time-complexity and runtime overhead. In addition, a statistical analysis shows that the results are reliable and can be generalized to other datasets or similar classifiers.

Introduction

Currently, a significant time- and cost-consuming phase of software engineering is software maintenance, where finding and handling the bugs and managing the changes are the most critical tasks [1]. Many efforts are directed at automatic bug detection using software testing approaches, such as static and dynamic testing, white and black box testing, and other testing strategies [1], [2], [3]. In addition to software testing, bugs reported by end-users should be explored because bug reports reveal bugs that were undetected in the software-testing phase. Furthermore, they improve the user experience of the software and update it for new end-user requirements. Software bug triage systems like Bugzilla are used to maintain the software, especially for receiving bug reports from end-users and for when the software has crashed. The bug reports should be prioritized, categorized [4], and handled by triagers until they can be assigned to developers [5]. There are many challenges in the bug report domain, e.g., prioritizing bug reports is particularly important, especially some bug reports like security bug reports can directly impact end-user attrition rates, which should be determined as soon as it is possible. There are many efforts toward an automatic determination of security bug reports [6]. Moreover, the severity of each bug report should be predicted, which can be triaged as soon as possible [7].

A complex process in bug report handling by triagers is the detection of duplicate bug reports (DBRD), which accounts for up to 70% of reports in the repository of the bug triage systems, especially for open source projects with large end-user communities [8]. There are two steps in DBRD, which also represent significant challenges. The first challenge is a feature extraction method that finds the most efficient features from bug reports. The second challenge is a duplicate detection method, requiring a predictor or classifier model. There are many studies on these two steps with the aim of improving the performance of DBRD. The first challenge is more critical because extracting the most efficient features can help with better duplicate detection; for example, it is not possible to detect apples versus oranges just by length, even though it could be achieved by color alone. Also, the volume is redundant when there exist length, width, and height. Therefore, better independent features can be helpful for duplicate detection. This study focuses on finding features that are more useful by including state-of-the-art features to improve the performance of DBRD.

There are many fields in bug reports; thus, many types of features can be extracted. Table 1 describes the common data fields of most software bug triage systems like Bugzilla [9].Additionally, these systems can be customized by triagers to contain more data fields; some data fields may have different names for the same concept, e.g., between Android Issue Tracker and Bugzilla. It is essential that there is a chain of duplicate bug reports, for example, bug report 1 (BR1) can be a duplicate of BR2, where BR2 is called the master bug report of BR1, and BR2 can be a duplicate of BR3. However, there is always a bug report in this chain that does not have a Merged Id and is therefore null. Thus, a tree of duplicate bug reports can be made based on these duplication chains, and there is a forest of duplication trees in bug report repositories, which are called duplicate buckets. The duplicate bug reports can use the bug id of the master bug report as a master id to detect their corresponding bucket [10].

The bug repositories of software triage systems are semi-structured. Some data fields are structured, such as identical, temporal, and categorical, whereas others are unstructured such as text and files. Structured data have numerable attributes with a single value in a specific range, such as tables in databases [11]. The attributes of structured data can be nominal, ordinal, or numeric values. Text, images, graphs, and file fields are unstructured data [12]. A high-performance query in databases cannot use unstructured data.

Further, it is time-consuming to find values in the content of unstructured data, compare them, or perform arithmetic or relational operations. Therefore, many similarity measurement metrics (such as cosine or Manhattan equations) also cannot be used. Unstructured data must be converted to structured data, which can be used easily by various operators [13]. This process is called feature extraction, which is used in many fields of study, including image processing and text mining [14]. There are many techniques in information retrieval for feature extraction based on the content of unstructured data, many of which will be described in Section II. This study focuses on a novel feature extraction method for introducing a new model for text fields, which are the most critical in DBRD.

Duplication is a binary operator; thus, it is not possible to classify a bug report as unique or duplicate without considering other bug reports. The duplicate detection process needs to compare the suspicious duplicate bug report with all bug reports in the repository and check every pair separately. Thus, the duplication label is always for two bug reports. The opposite of duplication is uniqueness, which is an operator in databases similar to the “distinct” operator in the “select” statement in structured query language –known as SQL-. The distinct operator considers the equality of all fields in both records, which is not possible in bug repositories because, in this context, textual fields contain the same concept with different instances. Therefore, simple equality is not helpful, and it is essential to consider various types of equality or similarity, known as features. After extracting these features, it is difficult to judge duplication; hence, predictive machine learning algorithms such as classifiers can be used to distinguish duplicates from non-duplicates. There have been many efforts in this phase to generate heuristic models, even though the classic machine learning algorithms have no problems and can detect duplications. The need for introducing new models when the old ones perform well and have proof of such performance is also considered in this research.

The primary issues in DBRD are as follows: (I) validation performance and (II) runtime performance. Validation performance refers to improving the accuracy, precision, recall, and identical metrics of duplicate detection, which indicate how many bug reports can be classified as duplicates and help the triagers. For example, according to a recent report [15], the recall rate was less than 80% for Eclipse, Firefox, and Open Office bug repositories. A new study [16] in DBRD shows a recall rate of less than 92% in the best situation and suggesting that 20 bug reports sent to triagers were not acceptable. In the case of 10 bug reports, the recall rate was below 90%. The runtime performance refers to the time and speed of the process of duplicate detection. When using DBRD offline, triagers want to find duplicate bug reports in a repository and group them. After grouping the current bug reports and making the repository clean and ready, the DBRD problem will be important for new bug reports for which the online DBRD should be used. Online DBRD addresses issues (I) and (II) simultaneously [17,18].

However, the traditional approaches used in the online DBRD need to be reviewed and optimized for (i) feature extraction, (ii) building a model for duplicate detection, and (iii) runtime performance. For example, some features by the DBRD approaches are time-consuming and should be replaced by some new simple features with low time complexity [15,19,20]. The main goal of this study is to improve the validation performance of DBRD by introducing a new model (Fig. 2) with new features.

Many feature extraction methods have been used to build numerical values from bug reports that describe their similarity. One of these is the BM25F model, which is an overall weighted average of term frequency (TF) and inverse document frequency (IDF) for all standard terms in a pair of duplicate suspicious reports. The central question of this research is on the BM25F model and the TF-IDF features to find if other aggregate functions of TF and IDF can be more useful and meaningful for DBRD instead of the weighted average of TF and IDF.

Another question of this research is on how we can find which feature is more efficient for validation performance. The problem of feature efficiency detection is similar to the dimension reduction problem in data mining, which has many heuristic and meta-heuristic approaches. This research uses an overall average of prior normalized heuristic metrics as a new hybrid metric to find the efficiency of each feature and pre-validate the new proposed features before experimental evaluation. It is necessary to find the efficiency of information of each feature to avoid introducing new additional useless features that have not yet been considered by previous researchers but are not significantly useful for validation performance. Thus, the main contributions of this study are as follows:

  • Proposing a feature extraction model by considering all types of feature extraction techniques

  • Proposing new textual features based on TF-IDF features for validation performance improvement

  • Introducing a new hybrid heuristic metric for validating the efficiency ofour proposed and the state-of-the-artfeatures in the DBRD process

  • Improving the performance of the duplicate bug report detector using different types of features withour introduced baseline model to evaluate proposed features

  • Finding the best machine learning algorithm as a duplicate detector for bug reports

  • Conducting statistical tests to ensurethe reproducibility of similar results for un-reviewedbug reports

This paper is organized as follows: Section II reviews the state-of-the-art in feature extraction from bug reports, and the techniques for detecting useful features. Section III describes the research methodology of DBRD. Section IV demonstrates our proposed features and uses our proposed method of detecting efficient features to find the quality of proposed features. Section V evaluates the proposed methods, and Section VI concludes the paper and envisions future work.

Section snippets

Literature review

As mentioned in the introduction, bug reports are a semi-structured software repository. It is challenging to compare unstructured data, such as text, images, and sounds. A usual method is to convert unstructured to structured data using feature extraction, where every feature represents a property of the primary data. Various multiple features can describe primary data more effectively. For example, histograms, representing the frequency of each color in the image, are useful in comparing two

Feature extraction method

Features can be categorized as data fields of bug reports where each category describes a new aspect of a bug report and is extracted using a unique technique. The main idea behind every feature extraction method is the comparison of a pair of bug reports based on their fields. There are many analysis methods, such as equality comparison or subtracting data fields. Based on a history of their usage, features may be divided into five categories [9]: (1) Textual, (2) Temporal, (3) Structural, (4)

Methods of detecting efficient features

Efficient feature discovery is a common issue in dimension reduction problems and in some metric-based classifiers such as decision tree (DT) or rule-based machine learning algorithms. There are some evident and common techniques for useless feature discovery, such as removing features with lower standard deviations (near to zero) or with the highest typical value (near 100%). Moreover, some features are correlated, meaning they both describe the same behavior, such as the length and volume of

Research methodology

Fig. 1 depicts the methodology of DBRD as a data flow diagram (DFD), which was reproduced from many related works [9,15,20,30,36,52,67]. The cylinder box, ellipse box, and rectangular boxrefer to repositories, middle going data, and processes, respectively. In this DFD, the dashed line refers to the primary input and output of the methodology. This DFD describes the raw bug reports (box 1) that will be pre-processed (box 2) to make cleaned bug reports (box 3) in the first step, as mentioned in

Proposed Features and Method of Detecting Efficient Features

There are four significant contributions of this study, which were mentioned in the last paragraph of Section I and are further elaborated below:

  • Proposing new textual features based on TF-IDF features for the feature extraction phase (box 6) in the methodology (Fig. 1).

  • Introducing a new hybrid heuristic metric for validating the efficiency of our proposed andthestate-of-the-art features in the DBRD process, which can be performedafter the feature extraction phase (box 6) in the methodology (

Evaluation and experimental results

The methodology of duplicate detection was depicted in Fig. 1; however, there are some considerations in this process. If there are n bug reports in the database, there are C(n,2) = n × (n - 1)/2 combinations to check, some of which have a duplicate label, and others are non-duplicate. The total duplicate pairs in the bug reports of the database were selected, and between 1.6 to 2.8 times pairs were chosen for non-duplicate labels, randomly for each dataset, as shown in Table 6. It should be

Conclusion

In this study, we proposed a new feature extraction model, some new features based on the aggregation values (min, max, and simple average) of the TF and IDF, and a heuristic approach to detecting the efficiency of our newly proposed features. Our proposed features have no computational overhead and retain the time complexity of the previous feature extraction process. The experimental results showed that our proposed features, compared with the state-of-the-art ones, achieved, on average, 2%

Credit author statement

The first and second authors participated in formal analysis and the validation of the porposed method. The third author edited the paper and verified the proposed method. Moreover, the second author Babamir is the supervisor of the project whose output is this paper.

Declaration of Competing Interest

The authors declare that they have no known competing financil interests or personal relationships that could have appeared to influence the work reported this paper. The authors declare the followingfinancil interests o/ personal relationships which may be considered as potentil competing interests.

Acknowledgement

The authors would like to thank the University of Kashan for supporting this study under Grant #23451.

References (86)

  • A.K. Uysal et al.

    A novel probabilistic feature selection method for text classification

    Knowl Based Syst

    (2012)
  • A. Ganeshpurkar et al.

    Chapter 7 - Concepts of Hypothesis Testing and Types of Errors

    Dosage Form Design Parameters

    (2018)
  • R. Pressman et al.

    Software Engineering

    (2014)
  • Y.Jiang, P.Lu, X.Su, T.Wang, LTRWES: anew framework for security bug report detection, Inf Softw Technol, 124...
  • Y.C. Cavalcanti et al.

    The bug report duplication problem: an exploratory study

    Software Quality Journal

    (2013)
  • B. Soleimani Neysiani et al.

    Methods of Feature Extraction for Detecting the Duplicate Bug Reports in Software Triage Systems

  • C. Sun et al.

    A discriminative model approach for accurate duplicate bug report retrieval

  • M.J. Pazzani et al.

    Content-based recommendation systems

    The Adaptive Web

    (2007)
  • D.W. Embley et al.

    Ontology-based extraction and structuring of information from data-rich unstructured documents

  • A. Hindle et al.

    Preventing duplicate bug reports by continuously querying bug reports

    Empirical Software Engineering

    (2018)
  • A. Hindle

    Stopping duplicate bug reports before they start with Continuous Querying for bug reports

  • S. Banerjee et al.

    Automated duplicate bug report classification using subsequence matching

  • B. Soleimani Neysiani et al.

    Improving Performance of Automatic Duplicate Bug Reports Detection Using Longest Common Sequence

  • B. Soleimani Neysiani et al.

    Duplicate Detection Models for Bug Reports of Software Triage Systems: a Survey

    Current Trends In Computer Sciences & Applications

    (2019)
  • B. Soleimani Neysiani et al.

    Automatic Duplicate Bug Report Detection using Information Retrieval-based versus Machine Learning-based Approaches

  • C. Sun et al.

    Towards more accurate retrieval of duplicate bug reports

  • B. Soleimani Neysiani et al.

    Automatic Typos Detection in Bug Reports

  • B. Soleimani Neysiani et al.

    Automatic Interconnected Lexical Typo Correction in Bug Reports of Software Triage Systems

  • B. Soleimani Neysiani et al.

    Fast Language-Independent Correction of Interconnected Typos to Finding Longest Terms

  • B. Soleimani Neysiani et al.

    New labeled dataset of interconnected lexical typos for automatic correction in the bug reports

    SN Applied Sciences

    (2019)
  • X. Yang et al.

    Combining word embedding with information retrieval to recommend similar bug reports

  • A. Budhiraja et al.

    DWEN: deep word embedding network for duplicate bug report detection in software repositories

  • B.Soleimani Neysiani, S.M.Babamir, New Methodology of Contextual Features Usage in Duplicate Bug Reports Detection, in:...
  • P. Runeson et al.

    Detection of duplicate defect reports using natural language processing

  • A. Lazar et al.

    Improving the accuracy of duplicate bug report detection using textual similarity measures

  • S. Wang et al.

    Improving bug localization using correlations in crash reports

  • X. Wang et al.

    An approach to detecting duplicate bug reports using natural language and execution information

    Proceedings of the 30th InternationalConferenceOn SoftwareEngineering

    (2008)
  • S. Kim et al.

    Crash graphs: A n-aggregatedview of multiple crashes to improve crash triage

    Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on

    (2011)
  • A. Alipour et al.

    A Contextual Approach Towards More Accurate Duplicate Bug Report Detection

    Proceedings of the 10th Working ConferenceOn Mining Software Repositories

    (2013)
  • A.T. Nguyen et al.

    Duplicate bug report detection with a combination of information retrieval and topic modeling

    Proceedings of the 27th IEEE/ACM International ConferenceOn Automated Software Engineering (ASE)

    (2012)
  • L. Hiew

    Assisted detection of duplicate bug reports, in: faculty of Graduate Studies(Computer Science)

    The University Of British Columbia

    (2006)
  • N. Jalbert et al.

    Automated duplicate detection for bug tracking systems

    IEEE International ConferenceOn Dependable Systems and Networks (DSN) With FTCS and DCC

    (2008)
  • N.K. Nagwani et al.

    Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs

    Proceedings of the International ConferenceOn Advances in Computing, Communication and Control

    (2009)
  • Cited by (25)

    • CASMS: Combining clustering with attention semantic model for identifying security bug reports

      2022, Information and Software Technology
      Citation Excerpt :

      Finally, we conclude our work in Section 7. In the early phase, bug tracking systems are used to help engineers to identify duplicate bug reports [7–11], evaluate the severity or priority of bug reports [12–14], trace bug reports back to relevant source documents [15–19], analyze and predict the effort needed to fix software bugs [20], work on characteristics of software vulnerabilities [21,22], and evaluate the ability of code analysis tools to detect security vulnerabilities [23]. Most of these methods applied textual similarity metrics (e.g., cosine similarity) and machine learning methods (e.g., SVM and KNN) to extract textual information, while the sequential and semantic information have not been considered.

    View all citing articles on Scopus
    View full text