iLinker: a novel approach for issue knowledge acquisition in GitHub projects

Zhang, Yang; Wu, Yiwen; Wang, Tao; Wang, Huaimin

doi:10.1007/s11280-019-00770-1

iLinker: a novel approach for issue knowledge acquisition in GitHub projects

Published: 27 January 2020

Volume 23, pages 1589–1619, (2020)
Cite this article

World Wide Web Aims and scope Submit manuscript

Yang Zhang ORCID: orcid.org/0000-0002-3111-1534¹,
Yiwen Wu¹,
Tao Wang¹ &
…
Huaimin Wang¹

683 Accesses
14 Citations
Explore all metrics

Abstract

Social coding facilitates the sharing of knowledge in GitHub projects. In particular, issue reports, as an important knowledge in the software development, usually contain relevant information, and can thus be shared and linked in the developers’ discussion to aid the issue resolution. Linking issues to potentially related issues, i.e. issue knowledge acquisition, would provide developers with more targeted resource and information when they search and resolve issues. However, identifying and acquiring related issues is in general challenging, because the real-world acquiring practice is time-consuming and mainly depends on the experience and knowledge of the individual developers. Therefore, acquiring related issues automatically is a meaningful task which can improve development efficiency of GitHub projects. In this paper, we formulate the problem of acquiring related issue knowledge as a recommendation problem. To solve this problem, we propose a novel approach, iLinker, combining information retrieval technique, i.e. TF-IDF, and deep learning techniques, i.e. Word Embedding and Document Embedding. Our evaluation results show that, in both coarse-grained recommendation and fine-grained recommendation tasks, iLinker outperforms the baseline approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantically-enhanced topic recommendation systems for software projects

Article 24 February 2023

Personalized project recommendation on GitHub

Article 20 April 2018

Heterogeneous Graph Neural Network-Based Software Developer Recommendation

Notes

https://github.com/about
https://bugzilla.mozilla.org/
https://en.wikipedia.org/wiki/Word_embedding
https://ai.google/research/pubs/pub44894
In our dataset, the percentage of developers’ linking duplicate issues is less than 20%: during the analysis, we randomly select 250 link cases from collected data of Request project (with population= 1,110 and confidence level= 95%, the confidence interval\(\simeq \)5.46), and manual check the duplicate relationship between two linked issues by following the strategy used by Ye et al. [31]. The analysis is performed by two coders (first and third author) separately. The inter-rater agreement between the two coders is almost perfect (Fleiss’s Kappa value [32] is 0.83). All authors reviewed and agreed on the final result.
In this study, we use Lancaster stemmer that was implemented by NLTK. Because it works very well in Python programs and it is a very aggressive stemming algorithm with the fastest processing speed. It can reduce our working set of words hugely, which is meaningful for the GitHub projects to quickly train issues data and build practical tools.
https://github.com/moment/moment
https://github.com/request/request
https://github.com/jquery/jquery
https://github.com/diaspora/diaspora
https://github.com/docker/compose
https://github.com/flynn/flynn
https://help.github.com/articles/about-pull-requests/
https://developer.github.com/v3/issues/
https://developer.github.com/v3/issues/timeline/
http://radimrehurek.com/gensim
In our study, for each query issue, we calculate its metric values for NextBug and iLinker. We compute p-value and Cliff’s delta based on all query issues. We use Bonferroni correction to counteract the impact of multiple hypothesis tests.
For each group, the Wilcoxon test results and Cliff’s delta confirm that their differences are significant and substantial.
E.g., https://github.com/request/request/issues/1592

References

Dabbish, L., Stuart, C., Tsay, J., Herbsleb, J.: Social Coding in Github: Transparency and Collaboration in an Open Software Repository. In: CSCW, pp. 1277–1286. ACM (2012)
Zhang, Y., Wang, H., Yin, G., et al.: Social media in GitHub: the role of @-mention in assisting software development. Sci. China Inf. Sci. 60(3), 032102 (2017)
Article Google Scholar
Gharehyazie, M., Ray, B., Filkov, V.: Some from Here, Some from There: Cross-Project Code Reuse in Github. In: MSR, pp. 291–301. IEEE (2017)
Sun, C., Lo, D., Khoo, S. -C., Jiang, J.: Towards More Accurate Retrieval of Duplicate Bug Reports. In: ASE, pp. 253–262. IEEE (2011)
Zhou, J., Zhang, H., Lo, D.: Where Should the Bugs Be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports. In: ICSE, pp. 14–24. IEEE (2012)
Rocha, H., Valente, M. T., Marques-Neto, H., Murphy, G. C.: An Empirical Study on Recommendations of Similar Bugs. In: SANER, pp. 46–56. IEEE (2016)
Le, Q., Mikolov, T.: Distributed Representations of Sentences and Documents. In: ICML, pp. 1188–1196 (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.: Distributed Representations of Words and Phrases and Their Compositionality. In: NIPS, pp. 3111–3119 (2013)
Xu, B., Ye, D., Xing, Z., Xia, X., Chen, G., Li, S.: Predicting Semantically Linkable Knowledge in Developer Online Forums via Convolutional Neural Network. In: ASE. ACM, pp. 51–62 (2016)
Ye, X., Shen, H., Ma, X., Bunescu, R., Liu, C.: From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In: ICSE, pp. 404–415. ACM (2016)
Yang, X., Lo, D., Xia, X., Bao, L., Sun, J.: Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports. In: ISSRE, pp. 127–137. IEEE (2016)
Fan, Y., Xia, X., Lo, D., Hassan, A.E.: Chaff from the wheat: Characterizing and determining valid bug reports. IEEE Transactions on Software Engineering (2018)
Li, L., Ren, Z., Li, X., Zou, W., Jiang, H.: How are Issue Units Linked? Empirical Study on the Linking Behavior in GitHub. In: APSEC, pp. 386–395. IEEE (2018)
Zampetti, F., Ponzanelli, L., Bavota, G., Mocci, A., Penta, M. D., Lanza, M.: How Developers Document Pull Requests with External References. In: ICPC, pp. 23-33. IEEE (2017)
Zhang, Y., Yu, Y., Wang, H., Vasilescu, B., Filkov, V.: Within-Ecosystem Issue Linking: a Large-Scale Study of Rails. In: Software Mining, pp. 12–19. ACM (2018)
Zhang, Y., Wu, Y., Wang, T., et al.: A novel approach for recommending semantically linkable issues in GitHub projects. Sci. China Inf. Sci. 62(9), 199105 (2019)
Article Google Scholar
Boisselle, V., Adams, B.: The Impact of Cross-Distribution Bug Duplicates, Empirical Study on Debian and Ubuntu. In: SCAM, pp. 131–140. IEEE (2015)
Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Dai, A. M., Olah, C., Le, Q. V.: Document embedding with paragraph vectors. arXiv:1507.07998 (2015)
Crowston, K., Scozzi, B.: Bug fixing practices within free/libre open source software development teams (2008)
Article Google Scholar
Jeong, G., Kim, S., Zimmermann, T.: Improving Bug Triage with Bug Tossing Graphs. In: ESEC/FSE, pp. 111–120. ACM (2009)
Xia, X., Lo, D., Ding, Y., Al-Kofahi, J. M., Nguyen, T. N., Wang, X.: Improving automated bug triaging with specialized topic model. IEEE Trans. Softw. Eng. 43(3), 272–297 (2017)
Article Google Scholar
Yan, M., Zhang, X., Yang, D., Xu, L., Kymer, J. D.: A component recommender for bug reports using Discriminative Probability Latent Semantic Analysis. Inf. Softw. Technol. 73, 37–51 (2016)
Article Google Scholar
Anvik, J., Hiew, L., Murphy, G. C.: Who Should Fix This Bug?. In: ICSE, pp. 361–370. ACM (2006)
Guo, P. J., Zimmermann, T., Nagappan, N., Murphy, B.: Characterizing and Predicting Which Bugs Get Fixed: an Empirical Study of Microsoft Windows. In: ICSE, pp. 495–504. IEEE (2010)
Bachmann, A., Bird, C., Rahman, F., Devanbu, P., Bernstein, A.: The Missing Links: Bugs and Bug-Fix Commits. In: FSE, pp. 97–106. ACM (2010)
Ye, X., Bunescu, R., Liu, C.: Learning to Rank Relevant Files for Bug Reports Using Domain Knowledge. In: FSE, pp. 689–699. ACM (2014)
Zhang, Y., Yin, G., Wang, T., Yu, Y., knowledge, H. Wang.: Evaluating Bug Severity Using Crowd-Based an Exploratory Study. In: Internetware, pp. 70–73. ACM (2015)
Wang, X., Zhang, L., Xie, T., Anvik, J., Sun, J.: An Approach to Detecting Duplicate Bug Reports Using Natural Language and Execution Information. In: ICSE, pp. 461–470. IEEE (2008)
Ye, D., Xing, Z., Kapre, N.: The structure and dynamics of knowledge network in domain-specific q&a sites: a case study of stack overflow. Empir. Softw. Eng. 22(1), 375–406 (2017)
Article Google Scholar
Landis, J. R., Koch, G. G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Article Google Scholar
Paice, C.: A Word Stemmer Based on the Lancaster Stemming Algorithm. In: ACM SIGIR, pp. 56–61 (1990)
Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: IJCAI, pp. 1137–1145 (1995)
Hindle, A., Alipour, A., Stroulia, E.: A contextual approach towards more accurate duplicate bug report detection and ranking. Empir. Softw. Eng. 21(2), 368–410 (2016)
Article Google Scholar
Thung, F., Kochhar, P. S., Lo, D.: Dupfinder: Integrated Tool Support for Duplicate Bug Report Detection. In: ASE, pp. 871-874. ACM (2014)
Tian, Y., Sun, C., Lo, D.: Improved Duplicate Bug Report Identification. In: CSMR, pp. 385-390. IEEE (2012)
Zhang, Y., Lo, D., Xia, X., Sun, J.-L.: Multi-factor duplicate question detection in stack overflow. J. Comput. Sci. Technol. 30(5), 981–997 (2015)
Article Google Scholar
Zhang, W. E., Sheng, Q. Z., Tang, Z., Ruan, W.: Related Or Duplicate: Distinguishing Similar CQA Questions via Convolutional Neural Networks. In: SIGIR, pp. 1153-1156. ACM (2018)

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments on earlier versions of this paper. This work was supported by A New Generation of Artificial Intelligence 2030 Program (Grant No.2018AAA0102304), National Grand R&D Plan (Grant No. 2018YFB1003903), and National Natural Science Foundation of China (Grant No. 61432020).

Author information

Authors and Affiliations

National University of Defense Technology, Changsha, 410073, China
Yang Zhang, Yiwen Wu, Tao Wang & Huaimin Wang

Authors

Yang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yiwen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Huaimin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Wu, Y., Wang, T. et al. iLinker: a novel approach for issue knowledge acquisition in GitHub projects. World Wide Web 23, 1589–1619 (2020). https://doi.org/10.1007/s11280-019-00770-1

Download citation

Received: 17 January 2019
Revised: 06 November 2019
Accepted: 06 December 2019
Published: 27 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11280-019-00770-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

iLinker: a novel approach for issue knowledge acquisition in GitHub projects

Abstract

Access this article

Similar content being viewed by others

Semantically-enhanced topic recommendation systems for software projects

Personalized project recommendation on GitHub

Heterogeneous Graph Neural Network-Based Software Developer Recommendation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

iLinker: a novel approach for issue knowledge acquisition in GitHub projects

Abstract

Access this article

Similar content being viewed by others

Semantically-enhanced topic recommendation systems for software projects

Personalized project recommendation on GitHub

Heterogeneous Graph Neural Network-Based Software Developer Recommendation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation