Neural joint attention code search over structure embeddings for software Q&A sites
Graphical abstract
Introduction
In software development activities, source code examples are critical for concept understanding, applying fixes, extending software functionalities, etc. Code search increases developers’ productivity and reduces duplication of effort. The goal of code search is to retrieve code fragments from a large code corpus that most closely match a developer’s intent, which is expressed in natural language. Previous studies have revealed that more than 60% of developers search code in Q&A sites every day (Hoffmann et al., 2007). As online software Q&A sites (e.g., StackOverflow, 0000, Github, 0000, Krugle, 0000) contain millions of open source projects with code fragments, many search engines are designed to retrieve code solutions for natural language (NL) queries issued by developers. Unfortunately, these search engines often return with unrelated or other language codes in search results (Sadowski et al., 2015), even when reformulated queries are provided (Carpineto and Romano, 2012).
Recent work from both academia and industry has enabled more advanced code search using different techniques. Consider the examples in Fig. 1, we collect three different NL queries and their code solutions from Post#133031 (StackOverflow, 2009a), Post#19370921 (StackOverflow, 2012b) and Post#11621614 (StackOverflow, 2012a) in StackOverflow. There are still several challenges for previous work as follow:
(1) Common features mismatch using IR-based techniques. Previous methods for code search applied with information retrieval (IR) techniques, but most of them depend greatly on the quality of matching terms contained in both NL queries and source codes (Haiduc et al., 2013). As NL query and source code in the (Shown in Fig. 1) pair is heterogeneous, they may not share enough common lexical tokens, synonyms, or language structures, especially in short queries. Although most approaches provide effective ways for query reformulation (Paik et al., 2014, Lu et al., 2018, Sirres et al., 2018) (e.g., query expansion, text paraphrase), over specific queries return no effective results (Grechanik et al., 2010). Moreover, these extractive methods cannot handle irrelevant/noisy keywords in NL queries effectively. In fact, most NL queries and source codes may only be semantically related, as similar in cross-language text retrieval or machine translation. Thus, deep-learning-based methods (Iyer et al., 2016, Huo et al., 2016, Chen and Zhou, 2018) are further proposed to mine the semantic relevance between NL query and source code. Similarly, we designed NJACS for code search in software Q&A sites using a two-way attention-based neural network, which can perform NL semantic query for various programming languages.
(2) Lack of integral code representation via structure embeddings. More recently, various neural models are applied to learning the unified representation of both source code and NL query, such as code annotation (Huo et al., 2016), bug localization (Chen and Zhou, 2018) and software repair (White et al., 2016), etc. Similarly with code search, Gu et al. (2018) propose a bi-modal neural network named CODEnn, but limit its application to Java code fragments. CODEnn relies on extracting subelements (including method names, tokens, and API sequences Gu et al., 2016), thus cannot generate the overall semantic representation of the code structure. Unfortunately, like the (Shown in Fig. 1) pair belongs to SQL domain with no method types, these code-split-based embedding methods (Paik et al., 2014, Grechanik et al., 2010) may not be suitable for code fragments of other structures and program types. Thus, compared to CODEnn, Sachdev et al., 2018, Yao et al., 2019 and Cambronero et al. (2019) present the different supervised neural models, labeled NCS, CoaCor and UNIF respectively that successfully learned code embeddings and integral code representation using corpora of query-code mappings. Nevertheless, these methods use one-hot representation (Turian et al., 2010), pre-trained word2vec embeddings (Mikolov et al., 2013a) or randomly initialized word embeddings (Gu et al., 2018, Yao et al., 2019) for pre-training neural networks, which lack of introducing external pre-training information to improve model performance similar to BERT (Devlin et al., 2018) used in NL processing (NLP). Unlike them, NJACS introduces pre-trained structure embeddings to further capture more structure information to enhance the integral code semantic representation.
(3) Lack of modeling attention focus and structure information: Most existing neural approaches (Huo et al., 2016, Chen and Zhou, 2018, Gu et al., 2018, Yao et al., 2019, Cambronero et al., 2019, Hu et al., 2018) utilize RNNs (Hochreiter and Schmidhuber, 1997) (Recurrent Neural Networks), CNNs (Pattanayak, 2017) (Convolutional Neural Networks) or DQNs (Mnih et al., 2013) (Deep Q-learning Networks) based bi-modal network for code search. They generally ignore the joint attention information between NL query and source code, thus cannot effectively capture deeper semantic matching signals. As for the (Shown in Fig. 1) pair, understanding the focus of NL queries (e.g., important terms “extract”, “regex” and “text”) is helpful for retrieving more relevant codes that talk about “text extraction” problem. However, such methods do not explicitly model query focus (attention links are shown in Fig. 1). Recent studies (Iyer et al., 2016, Devlin et al., 2018, Wan et al., 2018) have also proved that attention mechanism can be successfully applied in code summarization (Iyer et al., 2016, Wan et al., 2018) and various NLP tasks like Machine Reading Comprehension (MRC) (Devlin et al., 2018). Besides, many other approaches (Huo et al., 2016, Gu et al., 2018, Sachdev et al., 2018, Cambronero et al., 2019) fail to consider the potential structure information of source code, which carries additional semantics to the program functionality besides the lexical terms. For NJACS, its two-way attention mechanism can capture sufficiently matching terms to align and focus the more structural informative parts of source code to NL query.
To address aforementioned issues, we propose a novel neural network named NJACS for code search in software Q&A sites, which leverages two attention mechanisms as global attention and attentive pooling. To solve the problem of common features mismatch, NJACS learns an enhanced joint representation of NL query and source code, which captures the semantics in lexicon along with code structure and introduces focus information between NL query and code fragment. In NJACS, the desired codes can be retrieved from most common queries without frequent reformulation.
To the best of our knowledge, we are the first to propose joint-attention-based code search. The main contributions of our work are as follows.
- •
We propose NJACS, a neural joint attention network that utilizes two-way attention mechanism to improve unified representation learning from both natural language and programming languages for retrieving relevant code solutions in software Q&A sites.
- •
We design particular code structure embeddings for pre-training operations with respect to different programming languages and code structures, which is able to capture semantics of the program from both lexical and program structural perspectives.
- •
We construct large-scale software repositories containing almost 10 million pairs and 200 sampled examples with most candidate codes, all extracted from use-case scenarios like StackOverflow, and conduct comprehensive experiments on the repositories.
The rest of this paper is organized as follows. The next section gives the definitions and motivation. Sections 3 Learning structure embeddings, 4 NJACS: Neural joint attention code search using structure embeddings describes the detailed design of the structure embeddings and the NJACS model, respectively. Experiments are shown in Section 5. Sections 6 Evaluation and results, 7 Discussions gives the evaluation results and discussion. Section 8 presents the related work. Finally, we conclude the paper in Section 9.
Section snippets
Problem formulation
In our work, we focus on the desired code1 (also named as code solution) which can solve the problem for the questioner. Code search for software Q&A sites can be described as follows: Given a set of common retrieved lists, where each NL query comes together with
Learning structure embeddings
In NJACS, we adopt an advanced distributed representation technique from NLP, which learns structure embeddings - continuous distributed vectors for grouping similar code libraries or tokens in AST nodes. Based on pre-trained structure embeddings, NJACS can introduce more structure information in neural network training to improve code search performance. This section describes the detail steps for learning structure embeddings.
NJACS: Neural joint attention code search using structure embeddings
As described in Section 3, we learn the structure embeddings. In this section, we discuss the background of bi-modal embedding of heterogeneous data and introduce the neural techniques used over structure embeddings in NJACS.
Q&A datasets collection
As described in Section 4.3, NJACS requires a large-scale corpus that contains NL queries and their code fragments. To the best of our knowledge, existing code search datasets are collected without a general standard, which generally are based on human annotation or crawled from Q&A forums and online sites. Unfortunately, most Q&A forums provide personal opinions of users which are often not adequately confirmed or outdated. According to the statistics, around 74% of Python and 88% of SQL
Evaluation and results
In this section, we evaluate NJACS through experiments. Specifically, our experiments aim to address the following research questions:
- •
RQ1: Whether our proposed two-way attention-based neural code search approach for software Q&A sites, NJACS, achieves state-of-the-art performance in all the benchmarks ?
- •
RQ2: Whether NJACS can effectively improve the performance of code search for various programming languages using structure embeddings instead of other embeddings ?
- •
RQ3: Whether joint
Why does NJACS work?
We have identified three advantages of NJACS that may explain its effectiveness in code search.
Constructing bi-modal embedding model using joint attention information. Unlike traditional representation-based models such as NCS and DeepCS, NJACS adopts a joint attention based neural network instead of RNN-based bi-modal embedding techniques for combining sufficiently detailed matching signals and incorporates code and query terms importance learning using a two-way attention mechanism. This
Related work
Code search tasks have become increasingly popular recently. We survey some of the work that is most closely related to what we are focused on.
Feature extraction-based code search: There are many work studies on code search using IR techniques. For instance, McMillan et al. (2011) proposed Portfolio, which extracts relevant functions for motivation query via utilizing keyword matching and PageRank. Rahman and Roy (2018) proposed RACK to convert queries into list of API classes for collecting
Conclusion & future work
In this paper, we propose NJACS (Neural Joint Attention Code Search) as a novel, scalable and effective code search model for software Q&A sites. In addition, NJACS is also suitable for retrieving more general code fragments, such as code elements (e.g., method body in C#) and code libraries (e.g., function body in Python). NJACS introduces both special code structure embeddings and joint attention-based framework to better model the unified vector representations between code fragments and NL
CRediT authorship contribution statement
Gang Hu: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Min Peng: Validation, Formal analysis, Visualization, Software, Supervision, Funding acquisition. Yihan Zhang: Validation, Formal analysis, Visualization. Qianqian Xie: Resources, Writing - review & editing, Supervision, Data curation. Mengting Yuan: Formal analysis, Writing - review & editing, Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
Wewould like to thank anonymous reviewers for insightful comments on earlier drafts of this paper. The authors gratefully acknowledge the financial support of the National K&D Program of China Grants 2018YFC1604000 and 2018YFC1604003, and this work is also supported partially by Natural Science Foundation of China Grants 71950002 (Special Program), 61872272 and 61772382 (General Program).
Gang Hu received the MS degree in signal and information processing from Yunnan Minzu University, Kunming, China, in 2016. He is currently working toward the Ph.D. degree in Wuhan University. His current research focus areas include information retrieval and search-based software engineering.
References (102)
- et al.
High quality information extraction and query-oriented summarization for automatic query-reply in social network
Expert Syst. Appl. (ESWA)
(2016) - et al.
A method for speeding up feature extraction based on KPCA
Neurocomputing
(2007) - et al.
Learning to represent programs with graphs
(2017) - et al.
A convolutional attention network for extreme summarization of source code
- et al.
Bimodal modelling of source code and natural language
- et al.
Code2vec: learning distributed representations of code
Proc. ACM Program. Lang. (POPL
(2019) - et al.
Embedding-based query expansion for weighted sequential dependence retrieval model
- et al.
Latent Dirichlet allocation
J. Mach. Learn. Res. (JMLR)
(2003) - et al.
Enriching word vectors with subword information
Trans. Assoc. Comput. Linguist. (TACL)
(2017) - et al.
Hierarchical learning of cross-language mappings through distributed vector representations for code
When deep learning met code search
A survey of automatic query expansion in information retrieval
Acm Computing Surveys (CSUR)
A neural framework for retrieval and summarization of source code
Indexing by latent semantic analysis
J. Am. Soc. Inf. Sci. Technol. (JASIST)
Maximum likelihood from incomplete data via the EM algorithm
J. R. Stat. Soc. (JSTOR)
Bert: Pre-training of deep bidirectional transformers for language understanding
Applying deep learning to answer selection: A study and an open task
Devise: A deep visual-semantic embedding model
A search engine for finding highly relevant applications
Deep code search
Deep API learning
Automatic query reformulations for text retrieval in software engineering
Improving source code search with natural language phrasal representations of method signatures
Long short-term memory
Neural Comput.
Assieme: finding and leveraging implicit references in a web search interface for programmers
Deep code comment generation
Deep code comment generation with hybrid lexical and syntactical information
Empir. Softw. Eng. (ESE)
Unsupervised software repositories mining and its application to code search
Learning unified features from natural and programming languages for locating buggy source code
CodeSearchNet challenge: evaluating the state of semantic code search
Summarizing source code using a neural attention model
Mapping language to code in programmatic context
Repairing programs with semantic code search
FaCoY: a code-to-code search engine
Adapting neural text classification for improved software categorization
Thesaurus-based automatic query expansion for interface-driven code search
Relationship-aware code search for javascript frameworks
NLTK: the natural language toolkit
Query expansion via wordnet for effective code search
Interactive query reformulation for source-code search with word relations
IEEE Access
Aroma: code recommendation via structural code search
Proc. ACM Program. Lang. (PACMPL)
Effective approaches to attention-based neural machine translation
Codehow: Effective code search based on api understanding and extended boolean model
Visualizing data using t-SNE
J. Mach. Learn. Res. (JMLR)
Portfolio: finding relevant functions and their usage
Distributed representations of words and phrases and their compositionality
Cited by (11)
Transformer-based code search for software Q&A sites
2024, Journal of Software: Evolution and ProcessBig Code Search: A Bibliography
2023, ACM Computing SurveysMultilayer self-attention residual network for code search
2023, Concurrency and Computation: Practice and Experience
Gang Hu received the MS degree in signal and information processing from Yunnan Minzu University, Kunming, China, in 2016. He is currently working toward the Ph.D. degree in Wuhan University. His current research focus areas include information retrieval and search-based software engineering.
Ming Peng received the MS and Ph.D. degree from the Wuhan University of China, in 2002 and 2006. She is currently a professor at School of Computer Science, Wuhan University. Currently, she works on NLP as information retrieval and knowledge graph. She is a member of the CCF.
Yihan Zhang is currently working toward the BS degree in National University of Singapore. His research interests include information retrieval and code recommendation.
Qianqian Xie received the BS degree from Jiangxi Normal University, China, in 2016. She is currently working toward the Ph.D. degree in the School of Computer Science at Wuhan University. Her research interests include sparse coding and deep learning.
Mengting Yuan received the MS and Ph.D. degree from Wuhan University, Wuhan, China, in 2006. He is currently an associate professor at School of Computer Science, Wuhan University. His current research interest includes mining software repositories and deep learning.