Skip to main content
Log in

iLinker: a novel approach for issue knowledge acquisition in GitHub projects

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Social coding facilitates the sharing of knowledge in GitHub projects. In particular, issue reports, as an important knowledge in the software development, usually contain relevant information, and can thus be shared and linked in the developers’ discussion to aid the issue resolution. Linking issues to potentially related issues, i.e. issue knowledge acquisition, would provide developers with more targeted resource and information when they search and resolve issues. However, identifying and acquiring related issues is in general challenging, because the real-world acquiring practice is time-consuming and mainly depends on the experience and knowledge of the individual developers. Therefore, acquiring related issues automatically is a meaningful task which can improve development efficiency of GitHub projects. In this paper, we formulate the problem of acquiring related issue knowledge as a recommendation problem. To solve this problem, we propose a novel approach, iLinker, combining information retrieval technique, i.e. TF-IDF, and deep learning techniques, i.e. Word Embedding and Document Embedding. Our evaluation results show that, in both coarse-grained recommendation and fine-grained recommendation tasks, iLinker outperforms the baseline approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

Notes

  1. https://github.com/about

  2. https://bugzilla.mozilla.org/

  3. https://en.wikipedia.org/wiki/Word_embedding

  4. https://ai.google/research/pubs/pub44894

  5. In our dataset, the percentage of developers’ linking duplicate issues is less than 20%: during the analysis, we randomly select 250 link cases from collected data of Request project (with population= 1,110 and confidence level= 95%, the confidence interval\(\simeq \)5.46), and manual check the duplicate relationship between two linked issues by following the strategy used by Ye et al. [31]. The analysis is performed by two coders (first and third author) separately. The inter-rater agreement between the two coders is almost perfect (Fleiss’s Kappa value [32] is 0.83). All authors reviewed and agreed on the final result.

  6. In this study, we use Lancaster stemmer that was implemented by NLTK. Because it works very well in Python programs and it is a very aggressive stemming algorithm with the fastest processing speed. It can reduce our working set of words hugely, which is meaningful for the GitHub projects to quickly train issues data and build practical tools.

  7. https://github.com/moment/moment

  8. https://github.com/request/request

  9. https://github.com/jquery/jquery

  10. https://github.com/diaspora/diaspora

  11. https://github.com/docker/compose

  12. https://github.com/flynn/flynn

  13. https://help.github.com/articles/about-pull-requests/

  14. https://developer.github.com/v3/issues/

  15. https://developer.github.com/v3/issues/timeline/

  16. http://radimrehurek.com/gensim

  17. In our study, for each query issue, we calculate its metric values for NextBug and iLinker. We compute p-value and Cliff’s delta based on all query issues. We use Bonferroni correction to counteract the impact of multiple hypothesis tests.

  18. For each group, the Wilcoxon test results and Cliff’s delta confirm that their differences are significant and substantial.

  19. E.g., https://github.com/request/request/issues/1592

References

  1. Dabbish, L., Stuart, C., Tsay, J., Herbsleb, J.: Social Coding in Github: Transparency and Collaboration in an Open Software Repository. In: CSCW, pp. 1277–1286. ACM (2012)

  2. Zhang, Y., Wang, H., Yin, G., et al.: Social media in GitHub: the role of @-mention in assisting software development. Sci. China Inf. Sci. 60(3), 032102 (2017)

    Article  Google Scholar 

  3. Gharehyazie, M., Ray, B., Filkov, V.: Some from Here, Some from There: Cross-Project Code Reuse in Github. In: MSR, pp. 291–301. IEEE (2017)

  4. Sun, C., Lo, D., Khoo, S. -C., Jiang, J.: Towards More Accurate Retrieval of Duplicate Bug Reports. In: ASE, pp. 253–262. IEEE (2011)

  5. Zhou, J., Zhang, H., Lo, D.: Where Should the Bugs Be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports. In: ICSE, pp. 14–24. IEEE (2012)

  6. Rocha, H., Valente, M. T., Marques-Neto, H., Murphy, G. C.: An Empirical Study on Recommendations of Similar Bugs. In: SANER, pp. 46–56. IEEE (2016)

  7. Le, Q., Mikolov, T.: Distributed Representations of Sentences and Documents. In: ICML, pp. 1188–1196 (2014)

  8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)

  9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.: Distributed Representations of Words and Phrases and Their Compositionality. In: NIPS, pp. 3111–3119 (2013)

  10. Xu, B., Ye, D., Xing, Z., Xia, X., Chen, G., Li, S.: Predicting Semantically Linkable Knowledge in Developer Online Forums via Convolutional Neural Network. In: ASE. ACM, pp. 51–62 (2016)

  11. Ye, X., Shen, H., Ma, X., Bunescu, R., Liu, C.: From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In: ICSE, pp. 404–415. ACM (2016)

  12. Yang, X., Lo, D., Xia, X., Bao, L., Sun, J.: Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports. In: ISSRE, pp. 127–137. IEEE (2016)

  13. Fan, Y., Xia, X., Lo, D., Hassan, A.E.: Chaff from the wheat: Characterizing and determining valid bug reports. IEEE Transactions on Software Engineering (2018)

  14. Li, L., Ren, Z., Li, X., Zou, W., Jiang, H.: How are Issue Units Linked? Empirical Study on the Linking Behavior in GitHub. In: APSEC, pp. 386–395. IEEE (2018)

  15. Zampetti, F., Ponzanelli, L., Bavota, G., Mocci, A., Penta, M. D., Lanza, M.: How Developers Document Pull Requests with External References. In: ICPC, pp. 23-33. IEEE (2017)

  16. Zhang, Y., Yu, Y., Wang, H., Vasilescu, B., Filkov, V.: Within-Ecosystem Issue Linking: a Large-Scale Study of Rails. In: Software Mining, pp. 12–19. ACM (2018)

  17. Zhang, Y., Wu, Y., Wang, T., et al.: A novel approach for recommending semantically linkable issues in GitHub projects. Sci. China Inf. Sci. 62(9), 199105 (2019)

    Article  Google Scholar 

  18. Boisselle, V., Adams, B.: The Impact of Cross-Distribution Bug Duplicates, Empirical Study on Debian and Ubuntu. In: SCAM, pp. 131–140. IEEE (2015)

  19. Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  20. Dai, A. M., Olah, C., Le, Q. V.: Document embedding with paragraph vectors. arXiv:1507.07998 (2015)

  21. Crowston, K., Scozzi, B.: Bug fixing practices within free/libre open source software development teams (2008)

    Article  Google Scholar 

  22. Jeong, G., Kim, S., Zimmermann, T.: Improving Bug Triage with Bug Tossing Graphs. In: ESEC/FSE, pp. 111–120. ACM (2009)

  23. Xia, X., Lo, D., Ding, Y., Al-Kofahi, J. M., Nguyen, T. N., Wang, X.: Improving automated bug triaging with specialized topic model. IEEE Trans. Softw. Eng. 43(3), 272–297 (2017)

    Article  Google Scholar 

  24. Yan, M., Zhang, X., Yang, D., Xu, L., Kymer, J. D.: A component recommender for bug reports using Discriminative Probability Latent Semantic Analysis. Inf. Softw. Technol. 73, 37–51 (2016)

    Article  Google Scholar 

  25. Anvik, J., Hiew, L., Murphy, G. C.: Who Should Fix This Bug?. In: ICSE, pp. 361–370. ACM (2006)

  26. Guo, P. J., Zimmermann, T., Nagappan, N., Murphy, B.: Characterizing and Predicting Which Bugs Get Fixed: an Empirical Study of Microsoft Windows. In: ICSE, pp. 495–504. IEEE (2010)

  27. Bachmann, A., Bird, C., Rahman, F., Devanbu, P., Bernstein, A.: The Missing Links: Bugs and Bug-Fix Commits. In: FSE, pp. 97–106. ACM (2010)

  28. Ye, X., Bunescu, R., Liu, C.: Learning to Rank Relevant Files for Bug Reports Using Domain Knowledge. In: FSE, pp. 689–699. ACM (2014)

  29. Zhang, Y., Yin, G., Wang, T., Yu, Y., knowledge, H. Wang.: Evaluating Bug Severity Using Crowd-Based an Exploratory Study. In: Internetware, pp. 70–73. ACM (2015)

  30. Wang, X., Zhang, L., Xie, T., Anvik, J., Sun, J.: An Approach to Detecting Duplicate Bug Reports Using Natural Language and Execution Information. In: ICSE, pp. 461–470. IEEE (2008)

  31. Ye, D., Xing, Z., Kapre, N.: The structure and dynamics of knowledge network in domain-specific q&a sites: a case study of stack overflow. Empir. Softw. Eng. 22(1), 375–406 (2017)

    Article  Google Scholar 

  32. Landis, J. R., Koch, G. G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

    Article  Google Scholar 

  33. Paice, C.: A Word Stemmer Based on the Lancaster Stemming Algorithm. In: ACM SIGIR, pp. 56–61 (1990)

  34. Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: IJCAI, pp. 1137–1145 (1995)

  35. Hindle, A., Alipour, A., Stroulia, E.: A contextual approach towards more accurate duplicate bug report detection and ranking. Empir. Softw. Eng. 21(2), 368–410 (2016)

    Article  Google Scholar 

  36. Thung, F., Kochhar, P. S., Lo, D.: Dupfinder: Integrated Tool Support for Duplicate Bug Report Detection. In: ASE, pp. 871-874. ACM (2014)

  37. Tian, Y., Sun, C., Lo, D.: Improved Duplicate Bug Report Identification. In: CSMR, pp. 385-390. IEEE (2012)

  38. Zhang, Y., Lo, D., Xia, X., Sun, J.-L.: Multi-factor duplicate question detection in stack overflow. J. Comput. Sci. Technol. 30(5), 981–997 (2015)

    Article  Google Scholar 

  39. Zhang, W. E., Sheng, Q. Z., Tang, Z., Ruan, W.: Related Or Duplicate: Distinguishing Similar CQA Questions via Convolutional Neural Networks. In: SIGIR, pp. 1153-1156. ACM (2018)

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments on earlier versions of this paper. This work was supported by A New Generation of Artificial Intelligence 2030 Program (Grant No.2018AAA0102304), National Grand R&D Plan (Grant No. 2018YFB1003903), and National Natural Science Foundation of China (Grant No. 61432020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Wu, Y., Wang, T. et al. iLinker: a novel approach for issue knowledge acquisition in GitHub projects. World Wide Web 23, 1589–1619 (2020). https://doi.org/10.1007/s11280-019-00770-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-019-00770-1

Keywords

Navigation