• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-05-10
Davide Fucci; Giuseppe Scanniello; Simone Romano; Natalia Juristo

We present a quasi-experiment to investigate whether, and to what extent, sleep deprivation impacts the performance of novice software developers using the agile practice of test-first development (TFD). We recruited 45 undergraduates, and asked them to tackle a programming task. Among the participants, 23 agreed to stay awake the night before carrying out the task, while 22 slept normally. We analyzed the quality (i.e., the functional correctness) of the implementations delivered by the participants in both groups, their engagement in writing source code (i.e., the amount of activities performed in the IDE while tackling the programming task) and ability to apply TFD (i.e., the extent to which a participant is able to apply this practice). By comparing the two groups of participants, we found that a single night of sleep deprivation leads to a reduction of 50 percent in the quality of the implementations. There is notable evidence that the developers’ engagement and their prowess to apply TFD are negatively impacted. Our results also show that sleep-deprived developers make more fixes to syntactic mistakes in the source code. We conclude that sleep deprivation has possibly disruptive effects on software development activities. The results open opportunities for improving developers’ performance by integrating the study of sleep with other psycho-physiological factors in which the software engineering research community has recently taken an interest in.

更新日期：2020-01-10
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-05-15
Mathieu Nassif; Christoph Treude; Martin P. Robillard

Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called $\mathrm{Witt}$ Witt for the categorization of software technologies (an expanded version of the hypernym discovery problem). $\mathrm{Witt}$ Witt takes as input a phrase describing a software technology or concept and returns a general category that describes it (e.g., integrated development environment), along with attributes that further qualify it (commercial, php, etc.). By extension, the approach enables the dynamic creation of lists of all technologies of a given type (e.g., web application frameworks). Our approach relies on Stack Overflow and Wikipedia, and involves numerous original domain adaptations and a new solution to the problem of normalizing automatically-detected hypernyms. We compared $\mathrm{Witt}$ Witt with six independent taxonomy tools and found that, when applied to software terms, $\mathrm{Witt}$ Witt demonstrated better coverage than all evaluated alternative solutions, without a corresponding degradation in false positive rate.

更新日期：2020-01-10
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-05-18
Keheliya Gallaba; Shane McIntosh

Continuous Integration (CI) is a popular practice where software systems are automatically compiled and tested as changes appear in the version control system of a project. Like other software artifacts, CI specifications require maintenance effort. Although there are several service providers like Travis CI offering various CI features, it is unclear which features are being (mis)used. In this paper, we present a study of feature use and misuse in 9,312 open source systems that use Travis CI . Analysis of the features that are adopted by projects reveals that explicit deployment code is rare—48.16 percent of the studied Travis CI specification code is instead associated with configuring job processing nodes. To analyze feature misuse, we propose Hansel —an anti-pattern detection tool for Travis CI specifications. We define four anti-patterns and Hansel detects anti-patterns in the Travis CI specifications of 894 projects in the corpus (9.60 percent), and achieves a recall of 82.76 percent in a sample of 100 projects. Furthermore, we propose Gretel —an anti-pattern removal tool for Travis CI specifications, which can remove 69.60 percent of the most frequently occurring anti-pattern automatically. Using Gretel , we have produced 36 accepted pull requests that remove Travis CI anti-patterns automatically.

更新日期：2020-01-10
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-06-13
Malvika Rao; David F. Bacon; David C. Parkes; Margo I. Seltzer

An important question in a software economy is how to incentivize deep rather than shallow fixes. A deep fix corrects the root cause of a bug instead of suppressing the symptoms. This paper initiates the study of the problem of incentive design for open workflows in fixing code. We model the dynamics of the software ecosystem and introduce subsumption mechanisms . These mechanisms only make use of externally observable information in determining payments and promote competition between workers. We use a mean field equilibrium methodology to evaluate the performance of these mechanisms, demonstrating in simulation that subsumption mechanisms perform robustly across various environment configurations and satisfy important criteria for market design.

更新日期：2020-01-10
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-06-01
Simone Romano; Christopher Vendome; Giuseppe Scanniello; Denys Poshyvanyk

Dead code is a bad smell and it appears to be widespread in open-source and commercial software systems. Surprisingly, dead code has received very little empirical attention from the software engineering research community. In this paper, we present a multi-study investigation with an overarching goal to study, from the perspective of researchers and developers, when and why developers introduce dead code, how they perceive and cope with it, and whether dead code is harmful. To this end, we conducted semi-structured interviews with software professionals and four experiments at the University of Basilicata and the College of William & Mary. The results suggest that it is worth studying dead code not only in the maintenance and evolution phases, where our results suggest that dead code is harmful, but also in the design and implementation phases. Our results motivate future work to develop techniques for detecting and removing dead code and suggest that developers should avoid this smell.

更新日期：2020-01-10
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-07-25
Christoph Czepa; Uwe Zdun

Temporal properties are important in a wide variety of domains for different purposes. For example, they can be used to avoid architectural drift in software engineering or to support the regulatory compliance of business processes. In this work, we study the understandability of three major temporal property representations: (1) Linear Temporal Logic (LTL) is a formal and well-established logic that offers temporal operators to describe temporal properties; (2) Property Specification Patterns (PSP) are a collection of recurring temporal properties that abstract underlying formal and technical representations; (3) Event Processing Language (EPL) can be used for runtime monitoring of event streams using Complex Event Processing. We conducted two controlled experiments with 216 participants in total to study the understandability of those approaches using a completely randomized design with one alternative per experimental unit. We hypothesized that PSP, as a highly abstracting pattern language, is easier to understand than LTL and EPL, and that EPL, due to separation of concerns (as one or more queries can be used to explicitly define the truth value change that an observed event pattern causes), is easier to understand than LTL. We found evidence supporting our hypotheses which was statistically significant and reproducible.

更新日期：2020-01-10
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-04-16
Sven Amann; Hoan Anh Nguyen; Sarah Nadi; Tien N. Nguyen; Mira Mezini

Application Programming Interfaces (APIs) often have usage constraints, such as restrictions on call order or call conditions. API misuses , i.e., violations of these constraints, may lead to software crashes, bugs, and vulnerabilities. Though researchers developed many API-misuse detectors over the last two decades, recent studies show that API misuses are still prevalent. Therefore, we need to understand the capabilities and limitations of existing detectors in order to advance the state of the art. In this paper, we present the first-ever qualitative and quantitative evaluation that compares static API-misuse detectors along the same dimensions, and with original author validation. To accomplish this, we develop MuC , a classification of API misuses, and MuBenchPipe , an automated benchmark for detector comparison, on top of our misuse dataset, MuBench . Our results show that the capabilities of existing detectors vary greatly and that existing detectors, though capable of detecting misuses, suffer from extremely low precision and recall. A systematic root-cause analysis reveals that, most importantly, detectors need to go beyond the naive assumption that a deviation from the most-frequent usage corresponds to a misuse and need to obtain additional usage examples to train their models. We present possible directions towards more-powerful API-misuse detectors.

更新日期：2020-01-04
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-04-24
Osama Al-Baik; James Miller

In the past decades, software organizations have been relying on implementing process improvement methods to advance quality, productivity, and predictability of their development and maintenance efforts. However, these methods have proven to be challenging to implement in many situations, and when implemented, their benefits are often not sustained. Commonly, the workforce requires guidance during the initial deployment, but what happens after the guidance stops? Why do not traditional improvement methods deliver the desired results? And, how do we maintain the improvements when they are realized? In response to these questions, we have combined social and organizational learning methods with Lean's continuous improvement philosophy, Kaizen, which has resulted in an IDKL model that has successfully promoted continuous learning and improvement. The IDKL has evolved through a real-life project with an industrial partner; the study employed ethnographic action research with 231 participants and had lasted for almost 3 years. The IDKL requires employees to continuously apply small improvements to the daily routines of the work-procedures. The small improvements by themselves are unobtrusive. However, the IDKL has helped the industrial partner to implant continuous improvement as a daily habit. This has led to realizing sustainable and noticeable improvements. The findings show that on average, Lead Time has dropped by 46 percent, Process Cycle Efficiency has increased by 137 percent, First-Pass Process Yield has increased by 27 percent, and Customer Satisfaction has increased by 25 percent.

更新日期：2020-01-04
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-04-30
Meng Yan; Xin Xia; Emad Shihab; David Lo; Jianwei Yin; Xiaohu Yang

Technical debt (TD) is a metaphor to describe the situation where developers introduce suboptimal solutions during software development to achieve short-term goals that may affect the long-term software quality. Prior studies proposed different techniques to identify TD, such as identifying TD through code smells or by analyzing source code comments. Technical debt identified using comments is known as Self-Admitted Technical Debt (SATD) and refers to TD that is introduced intentionally. Compared with TD identified by code metrics or code smells, SATD is more reliable since it is admitted by developers using comments. Thus far, all of the state-of-the-art approaches identify SATD at the file-level. In essence, they identify whether a file has SATD or not. However, all of the SATD is introduced through software changes. Previous studies that identify SATD at the file-level in isolation cannot describe the TD context related to multiple files. Therefore, it is beneficial to identify the SATD once a change is being made. We refer to this type of TD identification as “Change-level SATD Determination”, which determines whether or not a change introduces SATD. Identifying SATD at the change-level can help to manage and control TD by understanding the TD context through tracing the introducing changes. To build a change-level SATD Determination model, we first identify TD from source code comments in source code files of all versions. Second, we label the changes that first introduce the SATD comments as TD-introducing changes. Third, we build the determination model by extracting 25 features from software changes that are divided into three dimensions, namely diffusion, history and message, respectively. To evaluate the effectiveness of our proposed model, we perform an empirical study on 7 open source projects containing a total of 100,011 software changes. The experimental results show that our model achieves a promising and better performance than four baselines in terms of AUC and cost-effectiveness (i.e., percentage of TD-introducing changes identified when inspecting 20 percent of changed LOC). On average across the 7 experimental projects, our model achieves AUC of 0.82, cost-effectiveness of 0.80, which is a significant improvement over the comparison baselines used. In addition, we found that “Diffusion” is the most discriminative dimension among the three dimensions of features for determining TD-introducing changes.

更新日期：2020-01-04
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-05-08
Gerardo Canfora; Fabio Martinelli; Francesco Mercaldo; Vittoria Nardone; Antonella Santone; Corrado Aaron Visaggio

With the increasing diffusion of mobile technologies, nowadays mobile devices represent an irreplaceable tool to perform several operations, from posting a status on a social network to transfer money between bank accounts. As a consequence, mobile devices store a huge amount of private and sensitive information and this is the reason why attackers are developing very sophisticated techniques to extort data and money from our devices. This paper presents the design and the implementation of LEILA (formaL tool for idEntifying mobIle maLicious behAviour), a tool targeted at Android malware families detection. LEILA is based on a novel approach that exploits model checking to analyse and verify the Java Bytecode that is produced when the source code is compiled. After a thorough description of the method used for Android malware families detection, we report the experiments we have conducted using LEILA. The experiments demonstrated that the tool is effective in detecting malicious behaviour and, especially, in localizing the payload within the code: we evaluated real-world malware belonging to several widespread families obtaining an accuracy ranging between 0.97 and 1.

更新日期：2020-01-04
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-05-15
Qinbao Song; Yuchen Guo; Martin Shepperd

Context: Software defect prediction (SDP) is an important challenge in the field of software engineering, hence much research work has been conducted, most notably through the use of machine learning algorithms. However, class-imbalance typified by few defective components and many non-defective ones is a common occurrence causing difficulties for these methods. Imbalanced learning aims to deal with this problem and has recently been deployed by some researchers, unfortunately with inconsistent results. Objective: We conduct a comprehensive experiment to explore (a) the basic characteristics of this problem; (b) the effect of imbalanced learning and its interactions with (i) data imbalance, (ii) type of classifier, (iii) input metrics and (iv) imbalanced learning method. Method: We systematically evaluate 27 data sets, 7 classifiers, 7 types of input metrics and 17 imbalanced learning methods (including doing nothing) using an experimental design that enables exploration of interactions between these factors and individual imbalanced learning algorithms. This yields 27 × 7 × 7 × 17 = 22491 results. The Matthews correlation coefficient (MCC) is used as an unbiased performance measure (unlike the more widely used F1 and AUC measures). Results: (a) we found a large majority (87 percent) of 106 public domain data sets exhibit moderate or low level of imbalance (imbalance ratio $<$< 10; median = 3.94); (b) anything other than low levels of imbalance clearly harm the performance of traditional learning for SDP; (c) imbalanced learning is more effective on the data sets with moderate or higher imbalance, however negative results are always possible; (d) type of classifier has most impact on the improvement in classification performance followed by the imbalanced learning method itself. Type of input metrics is not influential. (e) only ${\sim} 52\%$∼52% of the combinations of Imbalanced Learner and Classifier have a significant positive effect. Conclusion: This paper offers two practical guidelines. First, imbalanced learning should only be considered for moderate or highly imbalanced SDP data sets. Second, the appropriate combination of imbalanced method and classifier needs to be carefully chosen to ameliorate the imbalanced learning problem for SDP. In contrast, the indiscriminate application of imbalanced learning can be harmful.

更新日期：2020-01-04
• IEEE Trans. Softw. Eng. (IF 4.778) Pub Date : 2018-05-17
Hanefi Mercan; Cemal Yilmaz; Kamer Kaya

We present a configurable, hybrid, and parallel covering array constructor, called CHiP. CHiP is parallel in that it utilizes vast amount of parallelism provided by graphics processing units (GPUs). CHiP is hybrid in that it bundles the bests of two construction approaches for computing covering arrays; a metaheuristic search-based approach for efficiently covering a large portion of the required combinations and a constraint satisfaction-based approach for effectively covering the remaining hard-to-cover-by-chance combinations. CHiP is configurable in that a trade-off between covering array sizes and construction times can be made. We have conducted a series of experiments, in which we compared the efficiency and effectiveness of CHiP to those of a number of existing constructors by using both full factorial designs and well-known benchmarks. In these experiments, we report new upper bounds on covering array sizes, demonstrating the effectiveness of CHiP, and the first results for a higher coverage strength, demonstrating the scalability of CHiP.

更新日期：2020-01-04
Contents have been reproduced by permission of the publishers.

down
wechat
bug