Studying the Relationship Between the Usage of APIs Discussed in the Crowd and Post-Release Defects

https://doi.org/10.1016/j.jss.2020.110724Get rights and content

Highlights

  • In this paper, we investigate the relationship between the usage of APIs discussed in the crowd and software quality.

  • We propose a set of metrics based on crowd knowledge to study software quality.

  • We statistically show that source code files using more challenging APIs are more likely to be defect-prone.

  • We also show that our crowd-related metrics can help in explaining post-release defects.

  • Finally, we show that our proposed metrics can improve the predictive power of the traditional defect prediction models.

Abstract

Software development nowadays is heavily based on libraries, frameworks and their proposed Application Programming Interfaces (APIs). However, due to challenges such as the complexity and the lack of documentation, these APIs may introduce various obstacles for developers and common defects in software systems. To resolve these issues, developers usually utilize Question and Answer (Q&A) websites such as Stack Overflow by asking their questions and finding proper solutions for their problems on APIs. Therefore, these websites have become inevitable sources of knowledge for developers, which is also known as the crowd knowledge.

However, the relation of this knowledge to the software quality has never been adequately explored before. In this paper, we study whether using APIs which are challenging according to the discussions of the Stack Overflow is related to code quality defined in terms of post-release defects. To this purpose, we define the concept of challenge of an API, which denotes how much the API is discussed in high-quality posts on Stack Overflow. Then, using this concept, we propose a set of products and process metrics. We empirically study the statistical correlation between our metrics and post-release defects as well as added explanatory and predictive power to traditional models through a case study on five open source projects including Spring, Elastic Search, Jenkins, K-8 Mail Android Client, and OwnCloud Android client.

Our findings reveal that our metrics have a positive correlation with post-release defects which is comparable to known high-performance traditional process metrics, such as code churn and number of pre-release defects. Furthermore, our proposed metrics can provide additional explanatory and predictive power for software quality when added to the models based on existing products and process metrics. Our results suggest that software developers should consider allocating more resources on reviewing and improving external API usages to prevent further defects.

Introduction

Today software development is heavily based on libraries, frameworks and their offered APIs. However, these APIs may introduce various common defects in software systems. The causes of these defects may be the lack of proper API documentation (Souza et al., 2019), complexity, and poor structure of the API which leads to misunderstandings (Robillard and Deline, 2011, Campos et al., 2016a), backward compatibility issues (Wang et al., 2015), API correctness (e.g., unexpected behavior of the API)(Wang et al., 2015), and the change-proneness of API (Linares-Vásquez et al., 2014). Usually, such defects occur rapidly independent of the application domain (Campos et al., 2016a).

Upon encountering such errors, defects, and even conceptual questions, developers may ask for help by explaining the issue and sometimes attaching their code in Q&A websites such as Stack Overflow (Uddin and Khomh, 2019). Usually, they find their questions answered very quickly (with a median answer time of 11 min) (Mamykina et al., 2011). Both questions and answers may be validated and rated by other developers through voting and comments. Users who posted up-voted questions or answers, receive reputation scores which motivates individuals to contribute (Wang et al., 2015). Thus, Q&A websites such as Stack Overflow have become an inevitable source for developers to find solutions to their questions and issues (Wang et al., 2015).

At the time of writing this paper, more than 17.8 million questions, 27.2 million answers and 70 million comments have been submitted to Stack Overflow. According to the Stack Overflow developers’ survey, each month, about 50 million people visit the website to learn, share, and build their careers (Anon, 2018). Consequently, the information available on this website constitute an enormous crowd knowledge about common errors, defects, and concepts trusted by millions of developers (de Souza et al., 2014, Mao et al., 2017).

The knowledge obtainable from Stack Overflow covers a wide range of aspects such as security and performance issues (Mao et al., 2017), programming styles (Barua et al., 2014), and API usage obstacles (Wang and Godfrey, 2013). Souza et al. (2019) state that many of the posts on Stack Overflow are primarily about API usage challenges. Moreover, API-related issues inferred from mining Q&A websites, hold particular promise as they contain discussions of the real-world issues encountered by millions of developers (Wang et al., 2015). For example, the method cos(double angle) (in the Java programming language) which is offered by the class java.lang.Math has made a large number of developers confused. This method returns the trigonometric cosine of an angle and the only argument angle should be given in radians. However, many developers do not comply with this requirement. This misunderstanding yields plenty of highly up-voted questions related to this method.2 Additionally, focusing on more challenging APIs may prevent developers from concentrating on the application itself and new errors related to the application logic may take place.

Prior studies have shown the relation between API changes and the quantity of questions submitted to Stack Overflow (Linares-Vásquez et al., 2014), and mined developers’ obstacles (Wang and Godfrey, 2013) and opinions (Uddin and Khomh, 2019) on APIs according to this website. Nevertheless, none of the prior research has focused on analyzing the effect of this knowledge on defect prediction models. We conjecture that if we extract the knowledge about APIs from Stack Overflow, we can make use of this knowledge in better explaining and predicting software defects.

In this paper, we investigate the relationship between the usage of APIs discussed in the crowd and software quality. To this end, we define the concept of challenge of an API, i.e., how much an API is discussed in high quality discussions on Stack Overflow. To better study the quality of Stack Overflow discussions, we statistically investigate Stack Overflow quality descriptors mentioned in prior studies (e.g., up votes, view count, favorite count, questioner reputation, and etc.), and employ Explanatory Factor Analysis (EFA) (Fabrigar et al., 1999) to identify the underlying relationship between these quality descriptors. Using EFA, we found out that three underlying factors can explain the interrelationships among all quality descriptors. Furthermore, using the concept of the challenge of an API, we propose a set of metrics that are based on crowd knowledge to study software quality. We investigate how our proposed metrics can help in explaining software defects, that is, whether adding our metrics to the models built with traditional metrics, increases the proportion of variation that the prediction model accounts for. We also investigate whether our metrics can improve the predictive power of the baseline models.

To measure the code quality, we employ post-release defects since it is widely used by prior studies and researches in the software quality area (Shang et al., 2015, Shihab et al., 2012). Post-release defects are the defects found up to six months after the release of a given version (de Pádua and Shang, 2018). We perform our detailed case study over 17 million discussions of Stack Overflow on five open source projects including Spring, Elastic Search, Jenkins, K-8 Mail client, and OwnCloud client with a focus on the following research questions:

RQ1: Are source code files using more challenging APIs more likely to be defect-prone?

We find positive correlations between crowd-related metrics and post-release defects. Our results show that in 4 out of 10 releases, there exist at least one crowd-related metric which gained a higher correlation with post-release defects compared to pre-release defects which is shown to has the highest correlation to post-release defects among traditional metrics (Moser et al., 2008). Given a version, pre-release defects are the defects found up to six months before the release of that version (Chen et al., 2017).

RQ2: Can crowd knowledge help in explaining post-release defects?

We find out that our crowd-related metrics can provide additional statistically significant explanatory power for software quality over traditional baseline metrics. More specifically, we achieve 11%51% improvement when we add our metrics to the models based on traditional metrics. Further, we found out our metrics have a positive effect on prediction models.

RQ3: Can crowd knowledge help in predicting post-release defects?

When our crowd-related metrics are added to the model based on traditional metrics, the predictive power of the model increases by 4%18% in terms of F1 measure. More specifically, we find out that our metrics provide a better improvement in the projects that use more external APIs.

Our findings could be leveraged by software developers to allocate more reviewing and testing efforts on source codes using more challenging APIs to prevent further defects. However, this does not imply that they should avoid using more challenging APIs. Instead, we just complement prior research on identifying high risk source codes to optimize the process of testing and reviewing.

To the extent of our knowledge, this paper is the first attempt to establish an empirical link between the crowd knowledge obtained from Q&A websites and post-release defects.

In summary, the contributions of this paper include:

  • We propose new metrics based on the crowd knowledge which can be used to better explain and predict software defects.

  • We perform an empirical study and quantify the statistical relation between our metrics and software quality in terms of post-release defects.

The rest of the paper is organized as follows. Section 2 comes with a few motivating examples. Section 3 describes how we model the crowd knowledge. Section 4 covers how our study is setup. Section 5 presents the results of our case study. Section 6 discusses overall points about our study. Section 7 mentions the threats to the validity of our findings. Section 8 discusses related prior research to this work. Finally, Section 9 concludes this paper and provides future research directions.

Section snippets

Motivating examples

This section presents a few examples to motivate investigating the relation between using more challenging APIs discussed in the crowd and the source code quality.

In revision 1e7a75c042 of the OwnCloud Android client3 the developer uses the WeakReference class, which provides a reference that does not protect a referenced object from collection by the Java garbage collector. This revision introduces a bug, i.e., misusing WeakReference yielded some build time

Modeling the crowd knowledge

In this section, we describe how we model the crowd knowledge available on Stack Overflow in order to propose crowd-related metrics.

The high level process of calculating crowd-related metrics is depicted in Fig. 1. Our approach is based on APIs discussed in the crowd. Thus, as the first step, we parse the heterogeneous data of Stack Overflow (step 1) to extract the code elements and then APIs from discussions. Next, we identify APIs, which are the main concerns of the discussions, (step 2) by

Case study design

In this section, we introduce the systems that we employ as our case study and other data processing steps.

Case study results

In this section we present and discuss the results of our case study. First we statistically analyze the quality descriptors listed in Table 1. Then, for each research question provided in Section 1, we discuss the underlying motivation, our approach toward answering that question, and finally the obtained experimental results. At last, we conduct a qualitative analysis about the challenge of APIs.

Discussion

In this section we discuss the overall points about our findings. Today software development is heavily based on external packages and libraries. Although, our results show a relation between using more challenging APIs and defects, developers cannot and should not avoid using APIs with high challenge, because, leveraging libraries makes the code base smaller which increases the maintainability. Further, developers do not worry about further development and improvement of the external libraries

Threats to validity

In this section we discuss the threats to the validity of our study.

External Validity. Our study is based on five popular open source Java projects publicly available on GitHub. However, the results of our study may not necessarily generalize to all software systems and programming languages. Further, the H-AST we used for parsing discussions which is offered by Ponzanelli et al. (2015) is just available for discussions of Java language. For other languages, we need to implement the island

Related work

In this section, we describe related work with respect to the use of crowd knowledge in software engineering and defect prediction.

Conclusions and future work

Q&A websites such as Stack Overflow has turned to be an unavoidable tool for developers to ask and find solutions to their questions, issues and errors. However, the effect of the crowd knowledge obtainable from this huge source of information on explaining defects has never been empirically studied before. In this paper, we attempted to model this crowd knowledge by proposing a set of metrics and statistically investigated the relation between these crowd-related metrics and software quality.

CRediT authorship contribution statement

Hamed Tahmooresi: Conceptualization, Methodology, Software, Validation, Formal analysis, Writing - original draft. Abbas Heydarnoori: Project administration, Supervision, Methodology, Writing - review & editing. Reza Nadri: Software, Validation, Investigation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Hamed Tahmooresi is a Ph.D. student at the Sharif University of Technology. His research interests include software engineering, software architecture and design, and mining software repositories. Contact him at [email protected]

References (86)

  • AnonM.

    Stack exchange data explorer

    (2019)
  • BajajK. et al.

    Mining questions asked by web developers

  • BaruaA. et al.

    What are developers talking about? An analysis of topics and trends in Stack Overflow

    Empir. Softw. Eng.

    (2014)
  • BirdC. et al.

    Putting it all together: Using socio-technical networks to predict failures

  • BirdC. et al.

    Don’t touch my code!: Examining the effects of ownership on software quality

  • BishopC.M.

    Pattern Recognition and Machine Learning

    (2006)
  • Borg, M., Svensson, O., Berg, K., Hansson, D., 2019. SZZ unleashed: an open implementation of the SZZ...
  • CamposE.C. et al.

    Searching Stack Overflow for API-usage-related bug fixes using snippet-based queries

  • CamposE.C. et al.

    Searching crowd knowledge to recommend solutions for API usage tasks

    J. Softw.: Evol. Process

    (2016)
  • CattellR.B.

    The scree test for the number of factors

    Multivariate Behav. Res.

    (1966)
  • CernyB.A. et al.

    A study of a measure of sampling adequacy for factor-analytic correlation matrices

    Multivariate Behav. Res.

    (1977)
  • ChatterjeeP. et al.

    Exploratory study of slack q&a chats as a mining source for software engineering tools

  • ChenF. et al.

    Crowd debugging

  • ChidamberS.R. et al.

    A metrics suite for object oriented design

    IEEE Trans. Softw. Eng.

    (1994)
  • CordeiroJ. et al.

    Context-based recommendation to support problem solving in software development

  • DagenaisB. et al.

    Recovering traceability links between an API and its learning resources

  • d’AgostinoR.B.

    An omnibus test of normality for moderate and large size samples

    Biometrika

    (1971)
  • D’AmbrosM. et al.

    An extensive comparison of bug prediction approaches

  • DeCoster, J., 1998. Overview of Factor Analysis. Tuscaloosa,...
  • FabrigarL.R. et al.

    Evaluating the use of exploratory factor analysis in psychological research

    Psychol. Methods

    (1999)
  • FanY. et al.

    The impact of changes mislabeled by SZZ on just-in-time defect prediction

    IEEE Trans. Softw. Eng.

    (2019)
  • FuH. et al.

    Evaluating answer quality across knowledge domains: Using textual and non-textual features in social Q&A

  • FukushimaT. et al.

    An empirical study of just-in-time defect prediction using cross-project models

  • GyimothyT. et al.

    Empirical validation of object-oriented metrics on open source software for fault prediction

    IEEE Trans. Softw. Eng.

    (2005)
  • HassanA.E.

    Predicting faults using the complexity of code changes

  • HerraizI. et al.

    Beyond lines of code: Do we need more complexity metrics

  • HerzigK. et al.

    It’s not a bug, it’s a feature: How misclassification impacts bug prediction

  • JiarpakdeeJ. et al.

    The impact of correlated metrics on the interpretation of defect models

    IEEE Trans. Softw. Eng.

    (2019)
  • JinG. et al.

    Understanding and detecting real-world performance bugs

    ACM SIGPLAN Not.

    (2012)
  • KameiY. et al.

    A large-scale empirical study of just-in-time quality assurance

    IEEE Trans. Softw. Eng.

    (2013)
  • KhomhF. et al.

    Predicting post-release defects using pre-release field testing results

  • Linares-VásquezM. et al.

    API change and fault proneness: A threat to the success of android apps

  • Linares-VásquezM. et al.

    How do API changes trigger Stack Overflow discussions? A study on the android SDK

  • Cited by (0)

    Hamed Tahmooresi is a Ph.D. student at the Sharif University of Technology. His research interests include software engineering, software architecture and design, and mining software repositories. Contact him at [email protected]

    Abbas Heydarnoori is an assistant professor at the Sharif University of Technology. Before, he was a post-doctoral fellow at the University of Lugano, Switzerland. Abbas holds a Ph.D. from the University of Waterloo, Canada. His research interests focus on software evolution, mining software repositories, and recommendation systems in software engineering. Contact him at [email protected]

    Reza Nadri is currently a Master’s student and research assistant at the University of Waterloo, Canada. Before, he got his Bachelor’s degree from the Sharif University of Technology. His research interests include mining software repositories, software analytics, recommendation systems in software engineering, and social aspects of software engineering. Contact him at [email protected]

    1

    Present Address: School of Computer Science, University of Waterloo, 200 University Ave. W., Waterloo, ON, Canada, N2L 3G1.

    View full text