Neural joint attention code search over structure embeddings for software Q&A sites

https://doi.org/10.1016/j.jss.2020.110773Get rights and content

Highlights

  • Propose a joint attention framework for code search in software Q&A sites.

  • Introduce structure embeddings to enhance searching for code fragments.

  • Achieve encouraging performance on four different programming languages.

  • Construct high-quality evaluation corpus for code search in software Q&A sites.

Abstract

Code search is frequently needed in software Q&A sites for software development. Over the years, various code search engines and techniques have been explored to support user query. Early approaches often utilize text retrieval models to match textual code fragments for natural query, but fail to build sufficient semantic correlations. Some recent advanced neural methods focus on restructuring bi-modal networks to measure the semantic similarity. However, they ignore potential structure information of source codes and the joint attention information from natural queries. In addition, they mostly focus on specific code structures, rather than general code fragments in software Q&A sites.

In this paper, we propose NJACS, a novel two-way attention-based neural network for retrieving code fragments in software Q&A sites, which aligns and focuses the more structure informative parts of source codes to natural query. Instead of directly learning bi-modal unified vector representations, NJACS first embeds the queries and codes using a bidirectional LSTM with pre-trained structure embeddings separately, then learns an aligned joint attention matrix for query-code mappings, and finally derives the pooling-based projection vectors in different directions to guide the attention-based representations. On different benchmark search codebase collected from StackOverflow, NJACS outperforms state-of-art baselines with 7.5% to 6% higher Recall@1 and MRR, respectively. Moreover, our designed structure embeddings can be leveraged for other deep-learning-based software tasks.

Introduction

In software development activities, source code examples are critical for concept understanding, applying fixes, extending software functionalities, etc. Code search increases developers’ productivity and reduces duplication of effort. The goal of code search is to retrieve code fragments from a large code corpus that most closely match a developer’s intent, which is expressed in natural language. Previous studies have revealed that more than 60% of developers search code in Q&A sites every day (Hoffmann et al., 2007). As online software Q&A sites (e.g., StackOverflow, 0000, Github, 0000, Krugle, 0000) contain millions of open source projects with code fragments, many search engines are designed to retrieve code solutions for natural language (NL) queries issued by developers. Unfortunately, these search engines often return with unrelated or other language codes in search results (Sadowski et al., 2015), even when reformulated queries are provided (Carpineto and Romano, 2012).

Recent work from both academia and industry has enabled more advanced code search using different techniques. Consider the examples in Fig. 1, we collect three different NL queries and their code solutions from Post#133031 (StackOverflow, 2009a), Post#19370921 (StackOverflow, 2012b) and Post#11621614 (StackOverflow, 2012a) in StackOverflow. There are still several challenges for previous work as follow:

(1) Common features mismatch using IR-based techniques. Previous methods for code search applied with information retrieval (IR) techniques, but most of them depend greatly on the quality of matching terms contained in both NL queries and source codes (Haiduc et al., 2013). As NL query and source code in the Q1,C1 (Shown in Fig. 1) pair is heterogeneous, they may not share enough common lexical tokens, synonyms, or language structures, especially in short queries. Although most approaches provide effective ways for query reformulation (Paik et al., 2014, Lu et al., 2018, Sirres et al., 2018) (e.g., query expansion, text paraphrase), over specific queries return no effective results (Grechanik et al., 2010). Moreover, these extractive methods cannot handle irrelevant/noisy keywords in NL queries effectively. In fact, most NL queries and source codes may only be semantically related, as similar in cross-language text retrieval or machine translation. Thus, deep-learning-based methods (Iyer et al., 2016, Huo et al., 2016, Chen and Zhou, 2018) are further proposed to mine the semantic relevance between NL query and source code. Similarly, we designed NJACS for code search in software Q&A sites using a two-way attention-based neural network, which can perform NL semantic query for various programming languages.

(2) Lack of integral code representation via structure embeddings. More recently, various neural models are applied to learning the unified representation of both source code and NL query, such as code annotation (Huo et al., 2016), bug localization (Chen and Zhou, 2018) and software repair (White et al., 2016), etc. Similarly with code search, Gu et al. (2018) propose a bi-modal neural network named CODEnn, but limit its application to Java code fragments. CODEnn relies on extracting subelements (including method names, tokens, and API sequences Gu et al., 2016), thus cannot generate the overall semantic representation of the code structure. Unfortunately, like the Q2,C2 (Shown in Fig. 1) pair belongs to SQL domain with no method types, these code-split-based embedding methods (Paik et al., 2014, Grechanik et al., 2010) may not be suitable for code fragments of other structures and program types. Thus, compared to CODEnn, Sachdev et al., 2018, Yao et al., 2019 and Cambronero et al. (2019) present the different supervised neural models, labeled NCS, CoaCor and UNIF respectively that successfully learned code embeddings and integral code representation using corpora of query-code mappings. Nevertheless, these methods use one-hot representation (Turian et al., 2010), pre-trained word2vec embeddings (Mikolov et al., 2013a) or randomly initialized word embeddings (Gu et al., 2018, Yao et al., 2019) for pre-training neural networks, which lack of introducing external pre-training information to improve model performance similar to BERT (Devlin et al., 2018) used in NL processing (NLP). Unlike them, NJACS introduces pre-trained structure embeddings to further capture more structure information to enhance the integral code semantic representation.

(3) Lack of modeling attention focus and structure information: Most existing neural approaches (Huo et al., 2016, Chen and Zhou, 2018, Gu et al., 2018, Yao et al., 2019, Cambronero et al., 2019, Hu et al., 2018) utilize RNNs (Hochreiter and Schmidhuber, 1997) (Recurrent Neural Networks), CNNs (Pattanayak, 2017) (Convolutional Neural Networks) or DQNs (Mnih et al., 2013) (Deep Q-learning Networks) based bi-modal network for code search. They generally ignore the joint attention information between NL query and source code, thus cannot effectively capture deeper semantic matching signals. As for the Q3,C3 (Shown in Fig. 1) pair, understanding the focus of NL queries (e.g., important terms “extract”, “regex” and “text”) is helpful for retrieving more relevant codes that talk about “text extraction” problem. However, such methods do not explicitly model query focus (attention links are shown in Fig. 1). Recent studies (Iyer et al., 2016, Devlin et al., 2018, Wan et al., 2018) have also proved that attention mechanism can be successfully applied in code summarization (Iyer et al., 2016, Wan et al., 2018) and various NLP tasks like Machine Reading Comprehension (MRC) (Devlin et al., 2018). Besides, many other approaches (Huo et al., 2016, Gu et al., 2018, Sachdev et al., 2018, Cambronero et al., 2019) fail to consider the potential structure information of source code, which carries additional semantics to the program functionality besides the lexical terms. For NJACS, its two-way attention mechanism can capture sufficiently matching terms to align and focus the more structural informative parts of source code to NL query.

To address aforementioned issues, we propose a novel neural network named NJACS for code search in software Q&A sites, which leverages two attention mechanisms as global attention and attentive pooling. To solve the problem of common features mismatch, NJACS learns an enhanced joint representation of NL query and source code, which captures the semantics in lexicon along with code structure and introduces focus information between NL query and code fragment. In NJACS, the desired codes can be retrieved from most common queries without frequent reformulation.

To the best of our knowledge, we are the first to propose joint-attention-based code search. The main contributions of our work are as follows.

  • We propose NJACS, a neural joint attention network that utilizes two-way attention mechanism to improve unified representation learning from both natural language and programming languages for retrieving relevant code solutions in software Q&A sites.

  • We design particular code structure embeddings for pre-training operations with respect to different programming languages and code structures, which is able to capture semantics of the program from both lexical and program structural perspectives.

  • We construct large-scale software repositories containing almost 10 million pairs and 200 sampled examples with most candidate codes, all extracted from use-case scenarios like StackOverflow, and conduct comprehensive experiments on the repositories.

The rest of this paper is organized as follows. The next section gives the definitions and motivation. Sections 3 Learning structure embeddings, 4 NJACS: Neural joint attention code search using structure embeddings describes the detailed design of the structure embeddings and the NJACS model, respectively. Experiments are shown in Section 5. Sections 6 Evaluation and results, 7 Discussions gives the evaluation results and discussion. Section 8 presents the related work. Finally, we conclude the paper in Section 9.

Section snippets

Problem formulation

In our work, we focus on the desired code1 (also named as code solution) which can solve the problem for the questioner. Code search for software Q&A sites can be described as follows: Given a set of common retrieved lists, where each NL query qiQ comes together with

Learning structure embeddings

In NJACS, we adopt an advanced distributed representation technique from NLP, which learns structure embeddings - continuous distributed vectors for grouping similar code libraries or tokens in AST nodes. Based on pre-trained structure embeddings, NJACS can introduce more structure information in neural network training to improve code search performance. This section describes the detail steps for learning structure embeddings.

NJACS: Neural joint attention code search using structure embeddings

As described in Section 3, we learn the structure embeddings. In this section, we discuss the background of bi-modal embedding of heterogeneous data and introduce the neural techniques used over structure embeddings in NJACS.

Q&A datasets collection

As described in Section 4.3, NJACS requires a large-scale corpus that contains NL queries and their code fragments. To the best of our knowledge, existing code search datasets are collected without a general standard, which generally are based on human annotation or crawled from Q&A forums and online sites. Unfortunately, most Q&A forums provide personal opinions of users which are often not adequately confirmed or outdated. According to the statistics, around 74% of Python and 88% of SQL

Evaluation and results

In this section, we evaluate NJACS through experiments. Specifically, our experiments aim to address the following research questions:

  • RQ1: Whether our proposed two-way attention-based neural code search approach for software Q&A sites, NJACS, achieves state-of-the-art performance in all the benchmarks ?

  • RQ2: Whether NJACS can effectively improve the performance of code search for various programming languages using structure embeddings instead of other embeddings ?

  • RQ3: Whether joint

Why does NJACS work?

We have identified three advantages of NJACS that may explain its effectiveness in code search.

Constructing bi-modal embedding model using joint attention information. Unlike traditional representation-based models such as NCS and DeepCS, NJACS adopts a joint attention based neural network instead of RNN-based bi-modal embedding techniques for combining sufficiently detailed matching signals and incorporates code and query terms importance learning using a two-way attention mechanism. This

Related work

Code search tasks have become increasingly popular recently. We survey some of the work that is most closely related to what we are focused on.

Feature extraction-based code search: There are many work studies on code search using IR techniques. For instance, McMillan et al. (2011) proposed Portfolio, which extracts relevant functions for motivation query via utilizing keyword matching and PageRank. Rahman and Roy (2018) proposed RACK to convert queries into list of API classes for collecting

Conclusion & future work

In this paper, we propose NJACS (Neural Joint Attention Code Search) as a novel, scalable and effective code search model for software Q&A sites. In addition, NJACS is also suitable for retrieving more general code fragments, such as code elements (e.g., method body in C#) and code libraries (e.g., function body in Python). NJACS introduces both special code structure embeddings and joint attention-based framework to better model the unified vector representations between code fragments and NL

CRediT authorship contribution statement

Gang Hu: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Min Peng: Validation, Formal analysis, Visualization, Software, Supervision, Funding acquisition. Yihan Zhang: Validation, Formal analysis, Visualization. Qianqian Xie: Resources, Writing - review & editing, Supervision, Data curation. Mengting Yuan: Formal analysis, Writing - review & editing, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Wewould like to thank anonymous reviewers for insightful comments on earlier drafts of this paper. The authors gratefully acknowledge the financial support of the National K&D Program of China Grants 2018YFC1604000 and 2018YFC1604003, and this work is also supported partially by Natural Science Foundation of China Grants 71950002 (Special Program), 61872272 and 61772382 (General Program).

Gang Hu received the MS degree in signal and information processing from Yunnan Minzu University, Kunming, China, in 2016. He is currently working toward the Ph.D. degree in Wuhan University. His current research focus areas include information retrieval and search-based software engineering.

References (102)

  • PengM. et al.

    High quality information extraction and query-oriented summarization for automatic query-reply in social network

    Expert Syst. Appl. (ESWA)

    (2016)
  • XuY. et al.

    A method for speeding up feature extraction based on KPCA

    Neurocomputing

    (2007)
  • AllamanisM. et al.

    Learning to represent programs with graphs

    (2017)
  • AllamanisM. et al.

    A convolutional attention network for extreme summarization of source code

  • AllamanisM. et al.

    Bimodal modelling of source code and natural language

  • AlonU. et al.

    Code2vec: learning distributed representations of code

    Proc. ACM Program. Lang. (POPL

    (2019)
  • Balaneshin-kordanS. et al.

    Embedding-based query expansion for weighted sequential dependence retrieval model

  • BleiD.M. et al.

    Latent Dirichlet allocation

    J. Mach. Learn. Res. (JMLR)

    (2003)
  • BojanowskiP. et al.

    Enriching word vectors with subword information

    Trans. Assoc. Comput. Linguist. (TACL)

    (2017)
  • BuiN.D.Q. et al.

    Hierarchical learning of cross-language mappings through distributed vector representations for code

  • CambroneroJ. et al.

    When deep learning met code search

  • CarpinetoC. et al.

    A survey of automatic query expansion in information retrieval

    Acm Computing Surveys (CSUR)

    (2012)
  • ChenQ. et al.

    A neural framework for retrieval and summarization of source code

  • DeerwesterS. et al.

    Indexing by latent semantic analysis

    J. Am. Soc. Inf. Sci. Technol. (JASIST)

    (1990)
  • DempsterA.P. et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. R. Stat. Soc. (JSTOR)

    (1977)
  • DevlinJ. et al.

    Bert: Pre-training of deep bidirectional transformers for language understanding

    (2018)
  • FengM. et al.

    Applying deep learning to answer selection: A study and an open task

  • FromeA. et al.

    Devise: A deep visual-semantic embedding model

  • Gensim. Available:...
  • Github. Available:...
  • GrechanikM. et al.

    A search engine for finding highly relevant applications

  • GuX. et al.

    Deep code search

  • GuX. et al.

    Deep API learning

  • HaiducS. et al.

    Automatic query reformulations for text retrieval in software engineering

  • HillE. et al.

    Improving source code search with natural language phrasal representations of method signatures

  • HochreiterS. et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • HoffmannR. et al.

    Assieme: finding and leveraging implicit references in a web search interface for programmers

  • HuX. et al.

    Deep code comment generation

  • HuX. et al.

    Deep code comment generation with hybrid lexical and syntactical information

    Empir. Softw. Eng. (ESE)

    (2019)
  • HuG. et al.

    Unsupervised software repositories mining and its application to code search

  • HuoX. et al.

    Learning unified features from natural and programming languages for locating buggy source code

  • HusainH. et al.

    CodeSearchNet challenge: evaluating the state of semantic code search

    (2019)
  • Inflection. Available:...
  • IyerS. et al.

    Summarizing source code using a neural attention model

  • IyerS. et al.

    Mapping language to code in programmatic context

    (2018)
  • KeY. et al.

    Repairing programs with semantic code search

  • KimK. et al.

    FaCoY: a code-to-code search engine

  • Krugle. Available:...
  • LeClairA. et al.

    Adapting neural text classification for improved software categorization

  • LemosO.A.L. et al.

    Thesaurus-based automatic query expansion for interface-driven code search

  • LiX. et al.

    Relationship-aware code search for javascript frameworks

  • LoperE. et al.

    NLTK: the natural language toolkit

    (2002)
  • LuM. et al.

    Query expansion via wordnet for effective code search

  • LuJ. et al.

    Interactive query reformulation for source-code search with word relations

    IEEE Access

    (2018)
  • LuanS. et al.

    Aroma: code recommendation via structural code search

    Proc. ACM Program. Lang. (PACMPL)

    (2019)
  • LuongM.T. et al.

    Effective approaches to attention-based neural machine translation

    (2015)
  • LvF. et al.

    Codehow: Effective code search based on api understanding and extended boolean model

  • MaatenL. et al.

    Visualizing data using t-SNE

    J. Mach. Learn. Res. (JMLR)

    (2008)
  • McMillanC. et al.

    Portfolio: finding relevant functions and their usage

  • MikolovT. et al.

    Distributed representations of words and phrases and their compositionality

  • Gang Hu received the MS degree in signal and information processing from Yunnan Minzu University, Kunming, China, in 2016. He is currently working toward the Ph.D. degree in Wuhan University. His current research focus areas include information retrieval and search-based software engineering.

    Ming Peng received the MS and Ph.D. degree from the Wuhan University of China, in 2002 and 2006. She is currently a professor at School of Computer Science, Wuhan University. Currently, she works on NLP as information retrieval and knowledge graph. She is a member of the CCF.

    Yihan Zhang is currently working toward the BS degree in National University of Singapore. His research interests include information retrieval and code recommendation.

    Qianqian Xie received the BS degree from Jiangxi Normal University, China, in 2016. She is currently working toward the Ph.D. degree in the School of Computer Science at Wuhan University. Her research interests include sparse coding and deep learning.

    Mengting Yuan received the MS and Ph.D. degree from Wuhan University, Wuhan, China, in 2006. He is currently an associate professor at School of Computer Science, Wuhan University. His current research interest includes mining software repositories and deep learning.

    View full text