1 Introduction

Quality assurance with respect to both functional and non-functional quality characteristics of software becomes crucial to the success of software products. For example, an extra one-second delay in the load time of a storefront page can cause 11% reduction in page views, and 16% less customer satisfaction (NS8 2018). Moreover, banking, retailing, and airline reservation systems as samples of mission-critical systems are all required to be resilient against varying conditions affecting their functional performance (Weyuker and Vokolos 2000; Brunnert et al. 2015; Grinshpan 2012).

Performance, which has been also called ”efficiency” in the classification schemes of quality characteristics (ISO25000 2019; Glinz 2007; Chung et al. 2012), is generally referred to as how well a software system (service) accomplishes the expected functionalities. Performance requirements mainly describe time and resource bound constraints on the behavior of software, which are often expressed in terms of performance metrics such as response time, throughput, and resource utilization.

 Performance evaluation. Performance modeling and testing are common evaluation approaches to accomplish the associated objectives such as measurement of performance metrics, detection of functional problems emerging under certain performance conditions, and also violations of performance requirements (Jiang and Hassan 2015). Performance modeling mainly involves building a model of the software system’s behavior using modeling notations such as queueing networks, Markov processes, Petri nets, and simulation models (Cortellessa et al. 2011; Harchol-Balter 2013; Kant and Srinivasan 1992). Although models provide helpful insights into the performance behavior of the system, there are also many details of implementation and execution platform that might be ignored in the modeling (Denaro et al. 2004). Moreover, drawing a precise model expressing the performance behavior of the software under different conditions is often difficult. Performance testing as another family of techniques is intended to achieve the aforementioned objectives by executing the software under the actual conditions.

Verifying the robustness of the system in terms of finding performance breaking point is one of the primary purposes of performance testing. A performance breaking point refers to the status of software at which the system becomes unresponsive or certain performance requirements get violated.

 Research challenge. Performance testing to find performance breaking points remains a challenge for complex software and execution platforms. Testing approaches mainly raise issues of automated and efficient generation of test cases (test conditions) resulting in accomplishing the intended objective. Common approaches for generating the performance test cases such as using source code analysis (Zhang et al. 2012), linear programs and evolutionary algorithms on performance models (Zhang and Cheung 2002; Gu and Ge 2009; Di Penta et al. 2007) and UML models (Garousi 2010; Garousi 2008; Garousi et al. 2008; Costa et al. 2012; da Silveira et al. 2011), using use case-based (Draheim et al. 2006; Lutteroth and Weber 2008), and behavior-driven techniques (Schulz et al. 2019; Ferme and Pautasso 2018; Ferme and Pautasso 2017; Walter et al. 2016) mainly rely on source code or other artifacts, which might not always be available during the testing.

Regarding the aforementioned issues, we propose that machine learning techniques could tackle them. One category of machine learning algorithms is reinforcement learning (RL), which is mainly intended to train an agent (learner) on how to solve a problem in an environment through being rewarded or punished in a trial and error interaction with the environment. Model-free RL is a subset of RL enabling the learner to explore the environment (the behavior of the software under test (SUT) in an execution environment in our case) and learn the optimal policy, to accomplish the objective (generating performance test cases resulting in an intended performance breaking point in our case) without access to source code and a model of the system. The learner can store the learned policy and is able to replay the learned policy in future situations, which can lead to efficiency improvements.

 Goal of the paper. Our research goal is represented by the following question:

How can we adaptively and efficiently generate the performance test cases resulting in the performance breaking points for different software programs without access to the underlying source code and performance models?

Finding performance breaking point is a key purpose in robustness analysis, which is of great importance for many types of software systems, particularly in mission- and safety-critical domains (Fowler 2009). Moreover, the question above is worth exploring also in applications specifically, such as resource management (scaling, provisioning, and scheduling) for cloud services (Jennings & Stadler 2015), performance prediction (Venkataraman et al. 2016; Kolesnikov et al. 2019), and performance analysis of software services in other areas (Morabito 2017; Babovic et al. 2016).

 Contribution. In this paper, we present the design and experimental evaluation of a self-adaptive fuzzy reinforcement learning-based (SaFReL) performance testing framework. It is intended to efficiently and adaptively generate the (platform-based) performance test conditions leading to the performance breaking point for different software programs with different performance sensitivity to resources (e.g., CPU-, memory- and disk-intensive programs) without access to source code and performance models. An early-stage general formulation of the idea of using RL particularly in performance testing was introduced in our prior work (Moghadam et al. 2019). The initial formulation introduces a single smart tester agent that uses RL (simple Q-learning) in a two-phase learning together with an initial architecture in the abstract. This paper extends the initial abstract formulation of the RL-assisted performance testing (Moghadam et al. 2019). It uses an elaborate learning technique originally inspired by the conference paper by Ibidunmoye et al. (2017), which presents an adaptive performance (response time) control approach for cloud services using cooperative fuzzy multi-agent reinforcement learning. However, regarding the distinguishing learning details, the proposed RL-assisted performance testing framework is based on a single smart agent, involves two distinct phases of learning, and benefits a particular adaptive learning strategy which plays an important role in the functionality of the agent. The proposed smart performance testing framework is intended to conduct performance testing to meet a testing objective that is finding an intended performance breaking point. The proposed framework, SaFReL, is a two-phase RL-assisted performance testing agent that is able to learn the efficient generation of performance test cases to meet the testing objective and more importantly replay the learned policy in further similar testing situations.

SaFReL assumes two phases of learning: initial and transfer learning. In the initial learning phase, it learns the optimal policy to generate the target performance test cases initially upon observing the behavior of the first SUT. Afterward in the transfer learning, it reuses the learned policy for the SUTs with a performance sensitivity analogous to already observed ones while still keeping the learning running in the long term. The learning mechanism uses Q-learning augmented by fuzzy logic in one part of the learning to deal with the issue of uncertainty in defining discrete categories over continuous values as used by Ibidunmoye et al. (2017). The single light-weight RL tester agent has the capability of transfer learning and reusing knowledge in similar situations. It benefits an adaptive action selection strategy that adapts the learning to various testing situations and subsequently makes the agent able to act efficiently on various SUTs.

We demonstrate that SaFReL works adaptively and efficiently on different sets of SUTs, which are either homogeneous or heterogeneous in terms of their performance sensitivity. Our experiments are based on simulating the performance behavior of 50 instances of 12 well-known programs as the SUTs. Those instances are characterized by various initial amounts of granted resources and different values of response time requirements. We use two evaluation criteria, namely efficiency and adaptivity, to evaluate our approach. We investigate the efficiency of the approach in generating the test cases that result in reaching the intended performance breaking point and also the behavioral sensitivity of the approach to the learning parameters. In particular, SaFReL reaches the intended objective more efficiently compared to a typical stress testing technique, which generates the performance test cases based on changing the conditions, e.g., decreasing the availability of resources, by certain steps in an exploratory way. SaFReL leads to reduced cost (in terms of computation time) for performance test case generation by reusing the learned policy upon the SUTs with similar performance sensitivity. Moreover, it adapts its operational strategy to various SUTs with different performance sensitivity effectively while preserving efficiency. To summarize, our contributions in this paper are:

  • A smart performance testing framework (agent) that learns the optimal policy (way) to generate the performance test cases meeting the testing objective without access to source code and models and reuses the learned policy in further testing cases. It uses fuzzy RL and an adaptive action selection strategy for the generation of test cases and implements two phases of learning:

    • Initial learning during which the agent learns the optimal policy for the first time,

    • Transfer learning during which the agent replays the learned policy in similar cases while keeping the learning running in the long term.

  • A twofold experimental evaluation involving performance (efficiency and adaptivity) and sensitivity analysis of the approach. The evaluation is carried out based on simulating the performance behavior of various SUTs. We use a performance simulation module instead of actually executing SUTs. The main function of the performance simulation module is estimating the performance behavior of SUTs in terms of their response time.

Structure of the paper. The rest of the paper is organized as follows: Section 2 discusses the background concepts and motivations for the proposed self-adaptive learning-based approach. Section 3 presents an overview of the architecture of the proposed testing framework, while the technical details of the constituent parts are described in Sections 4 and 5. In Section 6, we explain the functions of the learning phases. Section 7 reports on the experimental evaluation involving the experiment’s setup, and the results of the experimentation. Section 8 discusses the results, the lessons learned during the experimentation, and also the threats to the validity of the results. Section 9 provides a review of the related work, and finally, Section 10 concludes the paper and discusses some future directions.

2 Motivation and background

Performance analysis, realized through modeling or testing, is important for performance-critical software systems in various domains. Anomalies in the performance behavior of a software system or violations of performance requirements are generally consequences of the emergence of performance bottlenecks at the system or platform levels (Ibidunmoye et al. 2015; Chandola et al. 2009). A performance bottleneck is a system or resource component limiting the performance of the system and hinders the system from acting as required (Gregg 2013). The behavior of a bottleneck component is due to some limitations associated with the component such as saturation and contention. A system or resource component saturation happens upon full utilization of its capacity or when the utilization exceeds a usage threshold (Gregg 2013). Capacity expresses the maximum available processing power, service (giving) rate, or storage size. Contention occurs when multiple processes contend for accessing a limited number of shared components such as resource components (e.g., CPU cycles, memory, and disk) or software (application) components.

There are various application-, platform- and workload-based causes for the emergence of performance bottlenecks (Ibidunmoye et al. 2015). Application-based causes represent issues such as defects in the source code or system architecture faults. Platform-based causes characterize the issues related to hardware resources, operating system, and execution platform. High deviations from the expected workload intensity and similar issues such as workload burstiness are denoted by workload-based causes.

On the other hand, detecting violations of performance requirements and finding performance breaking points are challenging, particularly for complex software systems. To address these challenges, we need to find how to provide critical execution conditions that make the performance bottlenecks emerge. The focus of performance testing in our case is to assess the robustness of the system and find the performance breaking point.

The effects of the internal causes (application/architecture-based ones) could vary, e.g., due to continuous changes and updates of the software during continuous integration/continuous delivery (CI/CD), and even vary upon different execution platforms and under different workload conditions. Therefore, the complexity of SUT and a variety of affecting factors make it hard to build a precise performance model expressing the effects of all types of factors at play. This is a major barrier motivating the use of model-free learning-based approaches like model-free RL in which the optimal policy for accomplishing the objective could be learned indirectly through interaction with the environment (SUT and the execution platform). In this problem statement, the testing system learns the optimal policy to achieve the target that is finding an intended performance breaking point, for different types of software without access to a model of the environment. The testing system explores the behavior of the SUT through varying the platform-based (and workload-based in future work) test conditions, stores the learned policy and is able to later reuse the learned policy in similar situations, i.e., other SUTs with similar performance sensitivity to resource restriction. This is the feature of the proposed learning approach that is supposed to lead to a considerable reduction in the testing system’s effort, and subsequently saving computation time.

Regarding the aforementioned challenges and strong points of the model-free learning-based approach, we hypothesize that in a CI/CD process based on agile software development, performance engineers and testers can save time and resources by using SaFReL for performance (stress) testing of various releases or variants. SaFReL provides an agile efficient performance test case generation technique (See Section 7 and Section 8 for efficiency evaluation) while eliminating the need for source code or system model analysis.

2.1 Reinforcement learning

Reinforcement learning (RL) (Sutton and Barto 2018) is a fundamental category of machine learning algorithms generally intended to find the optimal behavior (way) in decision-making problems. RL is an interactive learning paradigm that is different from the common supervised and unsupervised machine learning algorithms and has been frequently applied to building many self-adaptive smart systems. It involves continuous interaction between the agent (learner) and the environment that is controlled. At each step of the interaction, the agent observes (senses) the state of the environment, takes a possible action, and receives a reinforcement signal as a scalar reward from the environment that shows the effectiveness of the applied action to guide the agent toward accomplishing the intended objective. There is no supervisor in RL, and the agent just receives a reward signal. RL basically involves a sequential decision-making process. The RL agent goes through the environment, decides how to behave at each step, and based on optimizing the long-term received reward, learns the optimal way of decision making.

The agent actually decides between actions based on the history of its observations. However, considering the whole history of observations is not efficient, therefore, state should be formulated as a concise summary of the history including all the required information. Keeping in mind this issue, a related helpful concept to formulate the state as a summary function is the Markov state. The states of the environment are Markov by definition. Then, when the environment is fully observable to the agent, the states that the agent observes and uses for making decisions, are Markov too. The environment in our case is the SUT and the execution platform. The state is modeled in terms of response time and resource utilization improvement. The actions are some operations for modifying/adjusting the available capacity of resources and the objective of the agent is finding an intended performance breaking point. Figure 1 shows the interaction between the agent and the environment that is the composition of SUT and execution platform in our case.

There are three main elements in an RL agent: policy, value function, and model. The policy is the behavior function describing what actions the agent takes in a certain state. Value function indicates how good each state and/or action is, in terms of the amount of reward expected upon taking a particular action given a particular state. Finally, the model is the agent’s view of the environment and describes what the environment does next, e.g., shows the state transitions of the environment.

Model-free RL algorithms are special types of RL that are not intended to build or learn a model of the environment. Instead, they learn the optimal behavior to achieve the intended objective through multiple experiences of interaction with the environment. Temporal difference (TD) (Sutton and Barto 2018) is one of the main types of model-free RL, which is able to learn from the incomplete episodes of the interaction with the environment. Q-learning, as a model-free TD, learns the optimal policy through learning the optimal value function, i.e., Q-values. It uses an action selection strategy based on a combination of trying out the available actions, namely exploration, and relying on the previously achieved experience to select the highly-valued actions, namely exploitation. It is off-policy, which means that the agent learns the optimal policy regardless of how the agent explores the environment. After learning the optimal policy, in the transfer learning phase, the agent is able to replay the learned policy while keeping the learning running, which implies occasionally exploring the action space and trying out different actions.

Fig. 1
figure 1

Interaction between agent and SUT in RL

3 Architecture

This section provides an overview of the architecture of the proposed smart performance testing framework, SaFReL (see Fig. 2). The entire interaction of the smart framework with each SUT, as a learning episode, consists of a number of learning trials. The steps of learning in each trial and the components involved in each step are described as follows:

  1. 1.

    Fuzzy State Detection. The fuzzification, fuzzy inference, and rule base components in Fig. 2 are involved in the state detection. The agent uses the values of four quality metrics, 1) response time, and utilization improvements of 2) CPU, 3) memory, and 4) disk, to identify the state of the environment. In other words, the state expresses the status of the environment relative to the testing target. In our case, these quality metrics are used to model (represent) the state space of the environment. An ordinary approach for state modeling in RL problems is dividing the state space into multiple mutually exclusive discrete sets. Each set represents a discrete state. At each time, the environment must be at one distinct state. The relevant challenges of such crisp categorization or defining discrete states include knowing how much a value is suitable to be a threshold for categories of a metric, and how we can treat the boundary values between categories. Instead of crisp discrete states, using fuzzy logic and defining fuzzy states can help address these challenges. We use fuzzy classification as a soft labeling technique for presenting the values of the metrics used for modeling the state of the environment. Then, using a fuzzy inference engine and fuzzy rule base, the agent detects the fuzzy state of the environment. More details about the fuzzy state detection of the agent are presented in Section 4.

  2. 2.

    Action Selection and Strategy Adaptation. After detecting the fuzzy state of the SUT, the agent takes an action. The actions are operations modifying the factors affecting the performance, i.e., the available resource capacity, in the current prototype. The agent selects the action according to an action selection strategy that it follows. The action selection strategy determines to what extent the agent should explore and try out the available actions, and to what extent it should rely on the learned policy and select a high-value action that has been tried and assessed before. The role of this strategy is guiding the action selection of the agent throughout the learning and is of importance for the efficiency of the learning. In order to obtain the desired efficiency, a proper trade-off between the exploration of the state action space and exploitation of the previously learned policy is critical. In our proposed framework, the smart agent is augmented by a strategy adaptation characteristic, as a meta-learning feature responsible for dynamically adapting the degree of exploration and exploitation in various situations. This feature makes SaFReL able to detect where it should rely on the previously learned policy and where it should make a change in the strategy to update its policy and adapt to new situations. New situations mean acting on new SUTs that are different from the previously observed ones in terms of performance sensitivity to resources. Software programs have different levels of sensitivity to resources. SUTs with different performance sensitivity to resources, e.g., CPU-intensive, memory-intensive, or disk-intensive SUTs, will react to changes in resource availability differently. Therefore, when the agent observes a SUT that is different from the previously observed ones in terms of performance sensitivity, the strategy adaptation tries to guide the agent toward doing more exploration than exploitation. A performance sensitivity indicator showing the sensitivity of SUT to the resources (i.e., being CPU-intensive, memory-intensive, or disk-intensive) is an input to the strategy adaptation mechanism (see Fig. 2). The components corresponding to the action selection, the stored experience (learned policy), and the strategy adaptation are shown as yellow components in Fig. 2. More details about the set of actions and the mechanism of strategy adaptation are described in Section 5.

  3. 3.

    Reward Computation. After taking the selected action, the agent receives a reward signal indicating the effectiveness of the applied action to approach the intended performance breaking point. The reward computation component (red block) in Fig. 2 calculates the received reward (see Section 5) for the taken actions.

Fig. 2
figure 2

SaFReL architecture

4 Fuzzy state detection

The state space of the environment in our learning problem is modeled by the quality measurements, CPU, memory, and disk resource utilization improvement and response time of the SUT, which is shown in Fig. 3. The learning approach works based on detecting (discrete) states of the system. These states could be typically defined based on classifying the continuous values of the quality measurements that were mentioned above. On the other hand, defining such crisp boundaries on a number of continuous domains is an issue that might involve many uncertainties. In order to address this issue and preserve the desired precision of the model, fuzzy classification and reasoning are used to specify the states of the system. Therefore, the states of the environment are defined in terms of some fuzzy states and the environment can be in one or more fuzzy states at the same time with different degrees of certainty. The agent detects the state of the system using a fuzzy inference engine and a rule base (Kuncheva 2008; MathWorks 2019) (Fig. 2). In summary, the step of state detection is done based on making fuzzy inference about the state of the system. The fuzzy state detection consists of three main parts: normalization of the input values (quality measurements), fuzzification of the measurements, and the fuzzy inference to identify the state of the environment. The details of these parts together with the fuzzy rules, fuzzy operators, and the implication method that are used, are described in Section 4.1.

4.1 State modeling and fuzzy inference

Normalization. As described in the previous section, a set of quality measurements, CPU, memory, and disk utilization improvements and response time of the SUT, represent the state of the environment. The values of these measurements are not bounded, then for simplifying the inference and also the exploration of the state space, we normalize the values of these parameters to the interval [0, 1] using the following functions:

$$\begin{aligned} {RT_n}&= \frac{2}{\pi }\tan ^{-1}\left( \frac{RT_n^\prime }{RT^q}\right) \end{aligned}$$
(1)
$$\begin{aligned} {CUI}_n&= \frac{1}{CUI_n^\prime } \quad MUI_n=\frac{1}{MUI_n^\prime } \quad DUI_n=\frac{1}{DUI_n^\prime } \end{aligned}$$
(2)

where \(RT_n^\prime\), \(CUI_n^\prime\), \(MUI_n^\prime\) and \(DUI_n^\prime\) are the measured values of the response time, CPU, memory and disk utilization improvements at time step \(n\), respectively, and \(RT^q\) is the response time requirement. \(CUI_n^\prime\) as the CPU utilization improvement is the ratio between the CPU utilization at time step \(n\) and its initial value (at the start of learning), that is, \({CUI_n^\prime }=\frac{CU_n}{CU^i}\). Likewise, those are, \({MUI_n^\prime }=\frac{MU_n}{MU^i}\) and \({DUI_n^\prime }=\frac{DU_n}{DU^i}\). Using the normalization function in Eq. 1, when \({RT_n^\prime }=RT^q\) the normalized value of the response time, \(RT_n\) is 0.5, and for \({RT_n^\prime }> RT^q\) the normalized values will be toward 1 and for \({RT_n^\prime }< RT^q\) the normalized values will be toward 0. A tuple as \((CUI_n, MUI_n, DUI_n, RT_n)\) consisting of the normalized values of quality measurements is the input to the fuzzy state detection.

Fuzzification. Input fuzzification involves defining fuzzy sets and corresponding membership functions over the values of the quality measurements. A membership function is characterized by a linguistic term. A fuzzy set L is defined as \(L=\{(x, \mu _L(x))|\ 0<x\text {,}\quad x\in \mathbb {R}\}\) where a membership function \(\mu _L(x)\) defines membership degrees of the values as \(\mu _L:x\rightarrow [0,1]\). Figure 3 shows the membership functions defined over the value domains of quality measurements. As shown in Fig. 3, trapezoidal membership functions are used for High and Low fuzzy sets and a triangular counterpart for the Normal fuzzy set on the response time. In Fig. 3, where \(RT^q\) is the requirement, a normal (medium) fuzzy set over the values of response time implies a small range around the requirement value as normal response time values. Moreover, in this case, the ranges of membership functions were selected empirically and could be updated based on the requirements.

Fuzzy Inference. After input fuzzification, inferring the possible states that the environment assumes is directed by the fuzzy rules that have formed based on the domain knowledge.

Fuzzy Rules. A fuzzy rule, as shown in Eq. 3, consists of two parts: antecedent and consequent. The former is a combination of linguistic terms of the input normalized quality measurements and the consequent is a fuzzy set with a membership function showing to what extent the environment is in the associated state.

$$\begin{aligned} \text {Rule 1: }&\text {If CUI is High AND MUI is High AND DUI is Low AND}\nonumber \\&\text {RT is Normal, then State is HHLN}. \end{aligned}$$
(3)

Rule 1 is a sample of the fuzzy rules in the rule base. The rest of the rules are defined similarly based on the fuzzy sets defined over the values of the quality measurements and the combinations of them. Based on the number of fuzzy sets, namely two fuzzy sets, High and Low, over the value range of each resource utilization improvement and three sets, High, Normal, and Low, over the value range of the response time, we define 24 rules in our rule base to define the fuzzy states of the environment.

Fuzzy Operators. When the antecedents of the rules are made of multiple linguistic terms, which are associated with fuzzy sets, e.g., ”High, High, Low and Normal”, then fuzzy operators are applied to the antecedent to obtain one number showing the support or activation degree of the rule. Two well-known methods for the fuzzy \(AND\) operator are \(minimum (min)\) and \(product (prod)\). In our case, we use method \(min\) for the fuzzy \(AND\) operation. It shows that given a set of input parameters \(A\), the degree of support for rule \(Ri\) is given as \(\tau _{Ri}=\min \limits _j \mu _L(a_j)\) where \(a_j\) is an input parameter in A and L is its associated fuzzy set in the rule Ri.

Implication Method. After obtaining the membership degree for the antecedent, the membership function of the consequent is reshaped using an implication method. There are also two well-known methods for implication process, \(minimum (min)\) and \(product (prod)\), which truncate and scale the membership function of the output fuzzy set, respectively. The membership degree of the antecedent is given as input to the implication method. We use method \(min\) as the implication method in our case.

Finally, the most effective rule, the one with the maximum support degree, is selected to determine the final fuzzy state of the environment \({(S_n,\mu _n)}\). In summary, the fuzzy state with the highest likelihood is considered as the state of the system. Figure 4 shows a representation of the fuzzy states. Each of them represents one state based on the fuzzy values (linguistic terms) assigned to quality measurements (CPU, memory, and disk utilization improvement and response time). Regarding the presentation of fuzzy states, L, H and N stand for low, high, and normal terms, respectively.

Fig. 3
figure 3

Fuzzy representation of quality measurements

Fig. 4
figure 4

Fuzzy states of the environment

5 Adaptive action selection and reward computation

Actions. In SaFReL, the actions are the operations changing the platform-based factors affecting the performance, i.e., the available resources such as computation (CPU), memory, and disk capacity. In the current prototype, the set of actions contains operations reducing the available resource capacity with finely tuned steps, which are as follows:

$$\begin{aligned} AC_n= & {} \{\text {no action}\}\ \cup \ \{(CPU_n-y)\ |\ y \in CDF\}\ \cup \ \{(Mem_n-k)\ |\ k \in MDF_n \}\nonumber \\&\cup \ \{(Disk_n-k)\ |\ k \in MDF_n\}\end{aligned}$$
(4)
$$\begin{aligned} CDF= & {} \{\frac{1}{4},\frac{2}{4},\frac{3}{4},1\}\end{aligned}$$
(5)
$$\begin{aligned} MDF_n= & {} \{(x\times \frac{Mem(Disk)_n}{4})\ |\ x \in \{\frac{1}{4},\frac{2}{4},\frac{3}{4},1\}\} \end{aligned}$$
(6)

where \(AC_n\), \(CPU_n\), \(Mem_n\) and \(Disk_n\) represent the set of actions, the current available computation (CPU), memory, and disk capacity at time step n, respectively. The list of actions is as shown in Table 1.

Table 1 Actions in SaFReL

Strategy Adaptation. The agent can use different strategies for selecting the actions. \(\varepsilon\)-greedy with different \(\varepsilon\)-values and Softmax are well-known methods for action selection in RL algorithms. They are intended to provide a right trade-off between exploration of the state action space and exploitation of the learned policy. In SaFReL, we use \(\varepsilon\)-greedy as the action selection strategy and the proposed strategy adaptation feature acts as a simple meta-learning algorithm intended to make changes to the \(\varepsilon\) value dynamically to make the action selection strategy well-adapted to new situations (new SUTs). Upon observing a SUT instance with a performance sensitivity different from the already observed ones, it adjusts the value of the parameter \(\varepsilon\) to direct the agent toward more exploration (setting \(\varepsilon\) to higher values). On the other hand, upon interaction with SUT instances that are similar to the previous ones, the parameter \(\varepsilon\) is adjusted to increase exploitation (setting \(\varepsilon\) to lower values). SaFReL detects the similarity between SUT instances by calculating cosine similarity between the performance sensitivity vectors of SUT instances, as shown in Eq. 7.

$$\begin{aligned} \text {similarity}(k,k-1)&=\frac{SV^k\ SV^{k-1}}{\Vert SV^k\Vert \Vert SV^{k-1}\Vert }\nonumber \\&=\frac{\sum _{i=1}^{3} {SV_i^{k}SV_i^{k-1}}}{\sqrt{\sum _{i=1}^{3}{(SV_i^{k})}^2}\sqrt{\sum _{i=1}^{3}{\left( SV_i^{k-1}\right) }^2}} \end{aligned}$$
(7)

where \(SV^k\) represents the sensitivity vector of the \(k^{th}\) SUT instance and \(SV_i^k\) represents the \(i^{th}\) element of vector \(SV^k\). The sensitivity vector contains the values of the sensitivity indicators of the SUT instance, \(Sen^C\), \(Sen^M\), and \(Sen^D\). The performance sensitivity indicators assume values in the range [0, 1] and represent the sensitivity degree of the SUT to CPU, memory, and disk, respectively. Their values could be set empirically or even intuitively, and SaFReL uses the approximate estimated similarity to tune the \(\varepsilon\) value adaptively (See Section 7.2).

Reward Signal. The agent receives a reward signal indicating the effectiveness of the applied action in each learning step to guide the agent toward reaching the intended performance breaking point. We derive a utility function as a weighted linear combination of two functions indicating the response time deviation and resource usage, which is as follows:

$$\begin{aligned} R_n=\beta U_n^r+(1-\beta )U_n^E \end{aligned}$$
(8)

where \(U_n^r\) represents the deviation of response time from the response time requirement, \(U_n^E\) indicates the resource usage, and \(\beta\), \(0\le \beta \le 1\) is a parameter intended to prioritize different aspects of stress conditions, i.e., response time deviation or limited resource availability. \(U_n^r\) is defined as follows:

$$\begin{aligned} U_n^r = {\left\{ \begin{array}{ll} \text {0,} &{}\quad RT_n^\prime \le RT^q\\ \frac{(RT_n^\prime -RT^q)}{(RT^b-RT^q)}, &{}\quad RT_n^\prime > RT^q\\ \end{array}\right. } \end{aligned}$$
(9)

where \(RT_n^\prime\) is the measured response time, \(RT^q\) is the response time requirement and \(RT^b\) is the threshold defining the performance breaking point. \(U_n^E\) represents the resource utilization in the reward signal and is a weighted combination of the resource utilization values. It is defined using the following equation:

$$\begin{aligned} U_n^E= Sen^C CUI_n^{\prime }+ Sen^M MUI_n^{\prime } +Sen^D DUI_n^{\prime } \end{aligned}$$
(10)

where \(CUI_n^\prime\), \(MUI_n^\prime\), and \(DUI_n^\prime\) represent CPU, memory and disk utilization improvements, respectively, and \(Sen^C\), \(Sen^M\), and \(Sen^D\) are the performance sensitivity indicators of the SUT and assume values in the range [0, 1].

6 Performance testing using self-adaptive fuzzy reinforcement learning

In this section, we describe details of the procedure of SaFReL to generate the performance test cases resulting in reaching the performance breaking points for various types of SUTs. The tester agent learns how to generate the target test cases for different types of software without access to source code or system models. The procedure of SaFReL, which includes initial and transfer learning phases, is as follows:

The agent measures the quality parameters and identifies the state-membership degree pair, \((S_n,\mu _n )\), through the fuzzy state detection, where \(S_n\) is the fuzzy state of the environment and \(\mu _n\) indicates the membership degree, which means to what extent the environment has assumed that state. Then, according to the action selection strategy, the agent selects one action, \(a_n \in A_n\), based on the previously learned policy or through exploring the state action space. The agent takes the selected action and executes the SUT. In the next step, the agent detects the new state of the SUT, \((S_{n+1},\mu _{n+1})\), and receives a reward signal, \(r_{n+1}\in \mathbb {R}\), indicating the effectiveness of the applied action. After detecting the new state and receiving the reward, it updates the stored experience (learned policy). The whole procedure is repeated until meeting the stopping criterion that is reaching the performance breaking point, \((RT^b)\). The experience of the agent is defined in terms of the policy that the agent learns. A policy is a mapping between each state and action and specifies the probability of taking action \(a\) in a given state \(s\). The purpose of the agent in the learning is to find a policy that maximizes the expected long-term reward achieved over the further learning trials, which is formulated as follows: (Sutton and Barto 2018):

$$\begin{aligned} R_n=r_{n+1}+\gamma r_{n+2}+...+\gamma ^k r_{n+k+1}= \sum _{k=0}^{\infty } \gamma ^k r_{n+k+1} \end{aligned}$$
(11)

where \(\gamma\) is a discount factor specifying to what extent the agent prioritize future rewards compared to the immediate one. We use Q-learning as a model-free RL algorithm in our framework. In Q-Learning, a utility value, \(Q^\pi (s,a)\), is assigned to each pair of state and action, which is defined as follows: (Sutton and Barto 2018):

$$\begin{aligned} Q^\pi (s,a)=E^\pi [R_n | s_n=s,a_n=a] \end{aligned}$$
(12)

The q-values, \(Q^\pi (s,a)\), form the experience base of the agent, on which the agent relies for the action selection. The q-values are updated incrementally during the learning. According to using fuzzy state modeling, we include the membership degree of the detected state of the environment, \(\mu _n^s\), in the typical updating equation of q-values to take into account the impact of the uncertainty associated with the fuzzy state, which is as follows:

$$\begin{aligned} Q(s_n,a_n)=\mu _n^s\left[ (1-\alpha ) Q(s_n,a_n)+ \alpha \left( r_{n+1}+ \gamma\max \limits _{a^{\prime }} Q(s_{n+1},a^{\prime })\right) \right] \end{aligned}$$
(13)

where \(\alpha\), \(0 \le \alpha \le 1\) is the learning rate, which adjusts to what extent the new utility values affect (overwrite) the previous q-values. Finally, the agent finds the optimal policy to reach the target, which suggests the action maximizing the utility value for a given state s : 

$$\begin{aligned} a(s)= \mathop {\hbox {argmax}}\limits _{a^{\prime }} Q(s,a^{\prime }) \end{aligned}$$
(14)

The agent selects the action based on Eq. 14 when it is supposed to exploit the learned policy. SaFReL implements two learning phases: initial and transfer learning.

Initial learning. Initial learning occurs during the interaction with the first SUT instance. The initial convergence of the policy takes place upon the initial learning. The agent stores the learned policy (in terms of a table containing q-values, Q-table). It repeats the learning episode multiple times on the first SUT instance to achieve the initial convergence of the policy.

Transfer learning. SaFReL goes through the transfer learning phase, after the initial convergence. During this phase, the agent uses the learned policy upon observing SUT instances with similar performance sensitivity to the previously observed ones, while keeping the learning running, i.e., updating the policy upon detecting new SUT instances with different performance sensitivity. Strategy adaptation is used in the transfer learning phase and makes the agent adapt to various SUT instances. Algorithms 1 and 2 present the procedure of SaFReL in both initial learning and transfer learning phases.

figure f
figure g

7 Evaluation

In this section, we present the experimental evaluation of the proposed self-adaptive fuzzy RL-based performance testing framework, SaFReL. We assess the performance of SaFReL, in terms of efficiency in generating the performance test cases and adaptivity to various types of SUT programs, i.e., how well it can adapt its functionality to new cases while preserving its efficiency. Therefore, we examine the efficiency of SaFReL (in the transfer learning phase) compared to a typical testing process for this target, which involves generating the performance test cases through changing the availability of the resources based on the defined actions in an exploratory (random) way, which is called typical stress testing hereafter. We also evaluate the sensitivity of SaFReL to the learning parameters. The goal of the experimental evaluation is to answer the following research questions:

  • RQ1. How efficiently can SaFReL generate the test cases leading to the performance breaking points for different software programs compared to a typical testing procedure?

  • RQ2. How adaptively can SaFReL act on various software programs with different performance sensitivity?

  • RQ3. How is the efficiency of SaFReL affected by changing the learning parameters?

The following sub-sections describe the proposed setup for conducting the experiments, the evaluation metrics, and the analysis scenarios designed for answering the above research questions.

7.1 Experiments setup

In this study, we implement the proposed smart testing framework (agent) along with a performance simulation module simulating the performance behavior of SUT programs under different execution conditions. The simulation module receives the resource sensitivity values and based on the amounts of resources demanded initially and the amounts of them granted after taking each action, estimates the program throughput using the following equation proposed by Taheri et al. (2016):

$$\begin{aligned} Thr_j=\frac{\frac{CPU_j^g}{CPU_j^i}Sen_j^C +\frac{Mem_j^g}{Mem_j^i}Sen_j^M+ \frac{Disk_j^g}{Disk_j^i}Sen_j^D}{Sen_j^C+ Sen_j^M+ Sen_j^D}\times Thr_j^N \end{aligned}$$
(15)

where \(CPU_j^i\), \(Mem_j^i\), and \(Disk_j^i\) indicate the amounts of CPU, memory, and disk resources demanded by program j at the initial state and \(CPU_j^g\), \(Mem_j^g\), and \(Disk_j^g\) are the amounts of resources granted to program j after taking an action, which modifies the resource availability. \(Sen_j^C\), \(Sen_j^M\), and \(Sen_j^D\) represent the CPU, memory and disk sensitivity values of program j, and \(Thr_j^N\) represents the nominal throughput of program j in an isolated, contention-free environment. The response time of the program is calculated as \(RT_j=\frac{1}{Thr_j}\) in the simulation module. Figure 5 presents the implementation structure including SaFReL along with the implemented performance simulation module. In our implementation, the performance simulation module simulates the performance behavior of the SUT program and the testing agent interacts with the simulation module to capture the quality measures used for state detection.

Fig. 5
figure 5

Implementation structure

Table 2 shows the list of programs and the corresponding resource sensitivity values used in the experimentation, the table data obtained from (Taheri, Zomaya & Kassler 2016). The collection listed in Table 2 includes various CPU-intensive, memory-intensive, and disk-intensive types of programs and also the programs with combined types of resource sensitivity. The SUTs are instances of the programs listed in Table 2 and are characterized by various initial amounts of resources and also different values of response time requirements. Two analysis scenarios are designed to answer the evaluation research questions. The first one focuses on the efficiency and adaptivity evaluation of the framework on various SUTs. In the second analysis scenario, the sensitivity of the approach to changes of the learning parameters is studied. The efficiency and adaptivity are measured (evaluated) according to the following specification:

  • Efficiency is measured in terms of the number of learning trials required by the tester agent to achieve the testing target, which is reaching the intended performance breaking point. Number of learning trials is an indicator of the required computation time to generate the proper test case leading to the performance breaking point.

  • Adaptivity is evaluated in terms of the number of additional learning trials (computation time) required to re-adapt the learned policy to new observations for achieving the target.

Table 2 Programs and the corresponding sensitivity values used for experimental evaluation (Taheri et al. 2016)

7.2 Experiments and results

7.2.1 Efficiency and adaptivity analysis

To answer RQ1 and RQ2, the performance of SaFReL is evaluated based on its efficiency in generating the performance test cases leading to the performance breaking points of different SUTs and its adaptation capability to new SUTs with performance sensitivity different from previously observed ones. We select two sets of SUT instances: i) one including SUTs similar in the aspect of performance sensitivity to resources, i.e., similar with regard to the primarily demanded resource (homogenous SUTs); and ii) the other set contains SUT instances different in performance sensitivity (heterogeneous SUTs). The SUT instances assume different initial amounts of CPU, memory, and disk resources and response time requirements. The amounts of resources, CPU, memory, and disk capacity, were initialized with different values in the range [1, 10] cores, [1, 50] GB, [100, 1000] GB, respectively. The response time requirements range from 500 to 3000 ms. The intended performance breaking point for the SUT instances is defined as the point in which the response time exceeds 1.5 times the response time requirement.

In the efficiency analysis, we set the learning parameters, learning rate and discount factor, to 0.1 and 0.5, respectively. We study the impacts of different variants of \(\varepsilon\)-greedy algorithm as the action selection strategy on the efficiency and adaptivity of the approach during the analysis. We investigate three variants of \(\varepsilon\)-greedy with \(\varepsilon =0.2\), \(\varepsilon =0.5\), and decaying \(\varepsilon\), and also the proposed adaptive \(\varepsilon\) selection method.

Learning setup. First, we need to set up the initial learning. For choosing a proper configuration for the action selection strategy in the initial learning, we evaluate the performance of different variants of \(\varepsilon\)-greedy algorithm, in terms of the number of required learning trials for initial convergence (Fig. 6). For the initial convergence, we run the initial learning on the first SUT 100 times, namely 100 learning episodes. Table 3 presents a quick summarized view of the average learning trials during the last 10 episodes that are considered as the achieved values upon the convergence of the initial learning. As shown in Fig. 6 and Table 3, using \(\varepsilon\)-greedy with \(\varepsilon =0.2\) results in the fastest initial convergence, which has also led to the lowest number of trials compared to the other variants of \(\varepsilon\)-greedy. The number of learning trials after about 10 episodes starts converging and during the last 10 episodes it converges to approximately 7 trials.

Fig. 6
figure 6

Initial convergence of SaFReL in 100 learning episodes during the initial learning

Table 3 Initial convergence of SaFReL in the initial learning regarding using different variants of action selection strategy

Once the initial convergence occurs, SaFReL is ready to act on various SUTs and is expected to be able to reuse the learned policy to meet the intended performance breaking points on further SUT instances, while still keeping the learning running. The optimal policy learned in the initial learning is not influenced by the used action selection strategy, since Q-learning is an off-policy learning algorithm (Sutton and Barto 2018). It implies that the learner finds the optimal policy independently of how the actions have been selected (action selection strategy). For the sake of efficiency, we choose the one that resulted in the fastest convergence.

In the following sections, first, we investigate the efficiency of SaFReL compared to a typical stress testing procedure, when acting on homogeneous and heterogeneous sets of SUTs, then its capability to adapt to new SUTs with different performance sensitivity.

I. Homogeneous set of SUTs. We select CPU-intensive programs and make a homogeneous set of SUT instances during our analysis in this step. We simulate the performance behavior of 50 instances of the CPU-intensive programs, Build-apache, n-queens, John-the-ripper, Apache, Dcraw, Build-php, X264, and vary both the initial amounts of resources granted and the response time requirements. Figure 7 shows the efficiency of SaFReL on a homogeneous set of CPU-intensive SUTs compared to a typical stress testing procedure regarding using \(\varepsilon\)-greedy with different values of \(\varepsilon\). Table 4 presents the average number of trials/steps for generating the target performance test case in the proposed approach and the typical testing procedure. As shown in Fig. 7, it keeps the number of required trials for \(\approx 94\%\) of the SUTs below the average number of required steps in the typical stress testing. Table 5 shows the resulting improvement in the average number of required trials/steps for meeting the target, which implies a reduction in the required computation time, compared to the typical stress testing process.

In the transfer learning, the agent reuses the learned policy based on the allowed degree of policy reusing according to its action selection strategy in the transfer learning. As shown in Table 4, it implies that in the transfer learning the agent does fewer trials (based on the degree of allowed policy reusing) to meet the target on new cases, which leads to higher efficiency. According to Table 5, on a homogeneous set of SUTs, more policy reusing leads to higher efficiency (more computation time improvement).

Fig. 7
figure 7

Efficiency of SaFReL on a homogeneous set of SUTs in the transfer learning

Table 4 Average number of trials/steps for generating the target performance test case on the homogeneous set of SUTs
Table 5 Computation time improvement on the homogeneous set of SUTs

II. Heterogeneous set of SUTs. In this part of the analysis, to complete the answer to RQ1 and also answer RQ2, we examine the efficiency and adaptivity of SaFReL during the transfer learning on a heterogeneous set of SUTs including various CPU-intensive, memory-intensive, and disk-intensive ones. We simulate the performance behavior of 50 SUT instances from the list of the programs in Table 2. We evaluate the efficiency of SaFReL on the heterogeneous set of SUTs compared to the typical stress testing procedure regarding using \(\varepsilon\)-greedy with \(\varepsilon =0.2\), 0.5, and decaying \(\varepsilon\) (Fig. 8). As shown in Fig. 8 the transfer learning algorithm with a typical configuration of the action selection strategy, such as \(\varepsilon =0.2\), 0.5, and decaying \(\varepsilon\), which imposes a certain degree of policy reusing based on the value of \(\varepsilon\) does not work well. It does not outperform the typical stress testing, but also slightly degrades in some cases of \(\varepsilon\). When the smart agent acts on a heterogeneous set of SUTs, blind replaying of the learned policy (i.e., just based on the value of \(\varepsilon\)) is not effective, and the tester agent needs to know where it should do policy reusing and where it requires more exploration to update the policy.

Fig. 8
figure 8

Efficiency of SaFReL on a heterogeneous set of SUTs regarding the use of typical configurations of \(\epsilon\)-greedy

As described in Section 5, to solve this issue and improve the performance of SaFReL when it acts on a heterogeneous set of SUTs, it is augmented with a simple meta-learning feature enabling it to detect the heterogeneity of the SUT instances and adjust the value of parameter \(\varepsilon\), adaptively. In general, it implies that when the smart tester agent observes a SUT instance different from the previously observed ones wrt the performance sensitivity, it changes the focus of the action selection strategy into doing more exploration and upon detecting a SUT instance with the same performance sensitivity as the previous ones, it makes the action selection strategy strive for more exploitation. As illustrated in Section 5, the strategy adaptation module, which fulfills this function, measures the similarity between SUTs at two levels of observations, then based on the measured values, adjusts the value of parameter \(\varepsilon\). The threshold values of similarity measures and the adjustments for parameter \(\varepsilon\) in the experimental analysis are described in Algorithm 3.

figure h

Figure 9 shows the efficiency of SaFReL regarding the use of similarity detection and the adaptive \(\varepsilon\)-greedy action selection strategy on a heterogeneous set of SUTs. Regarding the use of adaptive \(\varepsilon\) selection, SaFReL makes a considerable improvement and is able to keep the number of required trials for reaching the target on approximately \(\approx 82\%\) of SUTs below the corresponding average value in the typical stress testing. Meanwhile, the average number of learning trials is totally lower than the typical stress testing procedure. Table 6 presents the average number of trials/steps for generating the target performance test case in SaFReL and the typical stress testing when they act on a heterogeneous set of SUTs. Table 7 shows the corresponding resulting improvement in the computation time, respectively.

Table 6 Average number of trials/steps for generating the target performance test case on the heterogeneous set of SUTs
Table 7 Computation time improvement on the heterogeneous set of SUTs

To answer RQ2, we investigate the adaptivity of SaFReL on the heterogeneous set of SUTs regarding the use of different variants of action selection strategy including adaptive \(\varepsilon\) selection (Fig. 10). As shown in Fig. 10, the number of required learning trials versus detected similarity is used to depict how adaptive SaFReL can act on a heterogeneous set of SUTs regarding the use of different configurations of \(\varepsilon\). It shows that SaFReL with adaptive \(\varepsilon\) is able to adapt to changing situations, e.g., a mixed heterogeneous set of SUTs. In other words, in around \(\approx 75\%\) of SUTs that are completely different from the previous ones (i.e., with \(similarity_{k,k-1} < 0.8\)) it still keeps the number of required trials to meet the target below the average value of the typical stress testing. It implies that it can act adaptively, which means it reuses the policy wherever it is useful and does more exploration wherever required.

Fig. 9
figure 9

Efficiency of SaFReL on a heterogeneous set of SUTs regarding the use of adaptive \(\epsilon\)-greedy action selection strategy

Fig. 10
figure 10

Adaptivity of SaFReL on a heterogeneous set of SUTs regarding the use of different variants of action selection strategy

7.2.2 Sensitivity analysis

To answer RQ3, we study the impacts of the learning parameters including learning rate (\(\alpha\)) and discount factor (\(\gamma\)), on the efficiency of SaFReL on both homogeneous and heterogeneous sets of SUTs. For conducting sensitivity analysis, we implement two sets of experiments that involve changing one learning parameter while keeping the other one constant. For the experiments running on a homogeneous set of SUTs, we use \(\varepsilon\)-greedy with \(\varepsilon =0.2\) as the well-suited variant of action selection strategy with respect to the results of efficiency analysis (See Fig. 7) and on the heterogeneous set of SUTs, we use adaptive \(\varepsilon\) selection (See Fig. 9). During the sensitivity analysis experiments, to study the impact of the learning rate changes, we set the discount factor to 0.5. While examining the impact of the discount factor changes, we keep the learning rate fixed to 0.1. Figure 11 shows the sensitivity of SaFReL to changing learning rate and discount factor parameters when it acts on a homogeneous set of SUTs (CPU-intensive). Figure 12 depicts the results of the sensitivity analysis of SaFReL on a heterogeneous set of SUTs.

Fig. 11
figure 11

Sensitivity of SaFReL to learning rate and discount factor on the homogeneous set of SUTs

Fig. 12
figure 12

Sensitivity of SaFReL to learning rate and discount factor on the heterogeneous set of SUTs

8 Discussion

8.1 Efficiency, adaptivity, and sensitivity analysis

RQ1: Using multiple experiments, we studied the efficiency of SaFReL compared to a typical stress testing procedure, on both a set of homogeneous and heterogeneous SUTs regarding the use of different action selection strategies. The results of the experiments running on a set of 50 CPU-intensive SUT instances as a homogeneous set of SUTs, Fig. 7 and Tables 4 and 5, show that using \(\varepsilon\)-greedy, \(\varepsilon =0.2\) as action selection strategy in the transfer learning leads to desired efficiency and an improvement in the computation time (around \(42\%\)) compared to the typical stress testing. It causes SaFReL to rely more on reusing the learned policy and results in computation time saving. The existing similarity between the performance sensitivity of SUTs in a homogeneous set of SUTs makes the strategy of policy reusing successful in this type of testing situations.

Furthermore, we studied the efficiency of SaFReL on a heterogeneous set of 50 SUTs containing different CPU-intensive, memory-intensive and disk-intensive ones. The results of the analysis illustrate that choosing an action selection strategy without considering the heterogeneity among the SUTs (e.g., using the typical variants of \(\varepsilon\)-greedy) does not lead to desirable efficiency compared to the typical stress testing (See Fig. 8 and Tables 6 and 7). Then, we augmented our fuzzy RL-based approach with an adaptive action selection strategy that is a heterogeneity-aware strategy for adjusting the value of \(\varepsilon\). It measures the similarity between the performance sensitivity of the SUTs and adjusts the \(\varepsilon\) parameter. As shown in Fig. 9, using the adaptive \(\varepsilon\)-greedy addressed the issue and led to an efficient generation of the target performance test case and a computation time improvement (around \(31\%\)). It makes the agent able to reuse the learned policy according to the conditions, which means it uses the learned policy wherever it is useful and does more exploration wherever it is required.

RQ2: In the last part of the efficiency and adaptivity analysis, we extended our analysis by measuring the adaptivity of SaFReL when it performs on a heterogeneous set of SUTs. As shown in Fig. 10, with the use of the adaptive \(\varepsilon\)-greedy, SaFReL is able to adapt to changing testing situations while preserving the efficiency.

RQ3: The results of the sensitivity analysis experiments on the homogeneous set of SUTs show that adjusting the learning rate to lower values such as 0.1 leads to better efficiency. Furthermore, regarding the sensitivity analysis of SaFReL to the discount factor on a homogeneous set of SUTs, the experimental results depict that lower values of the discount factor are suitable choices for the desired operation that we expect. However, the results of the sensitivity analysis on the heterogeneous set of SUTs do not show a considerable effect on the average efficiency of SaFReL when it acts on a heterogeneous set of SUTs regarding the use of adaptive \(\varepsilon\)-greedy.

8.2 Lessons learned

The experimental evaluation of SaFReL shows how machine learning can guide performance testing toward being automated and taking one step further toward being autonomous. Common approaches for generating performance test cases mostly rely on source code or system models, but such development artifacts might not always be available. Moreover, drawing a precise model of a complex system predicting the state of the system upon given performance-related conditions requires a solid endeavor. This makes room for machine learning, particularly model-free learning techniques. Model-free RL is a machine learning technique enabling the learner to explore the environment (the behavior of the SUT on the execution platform in this case) and learn the optimal policy to accomplish the objective (finding the intended performance breaking point in this case) without having a model of the system. The learner stores the learned policy and is able to replay the learned policy in further suitable situations. This important characteristic of RL leads to a reduction in the effort of the learner to accomplish the objective in further cases and consequently leads to improved efficiency. Therefore, the main features that lead SaFRel to outperform an exploratory (search-based) technique are the capability of storing knowledge during the exploration and reusing the knowledge in suitable situations, and the possibility of selective and adaptive control on exploration and exploitation.

In general, automation, reduction in computation time and cost, and less dependency on source code and models are profound strengths of the proposed RL-assisted performance testing. Regarding applicability, according to the aforementioned strengths and the results of the experimental evaluation, the proposed approach could be beneficial to performance testing of software variants in software product lines, evolving software in continuous integration/delivery process, and performance regression testing.

Changes in Future Trends. With the emergence of serverless architecture, which incorporates third-party backend services (BaaS) and/or runs the server-side logic in state-less containers that are fully-managed by providers (FaaS), a slight shift in the objectives of performance evaluation, particularly performance testing on cloud-native applications is expected. Within the serverless architecture, the backend code is run without the need to manage and provision the resources on servers. For example in FaaS, scaling, including the resource provisioning and allocation, is automatically done by the provider whenever it is needed, to preserve the response time requirement of the application. In general, regarding the capabilities of new execution platforms and deployment architectures, the objectives of performance testing might be slightly influenced. Nevertheless, it is still crucial for a wide range of software systems.

8.3 Threats to validity

Some of the main sources of threat to the validity of our experimental evaluation results are as follows:

Construct. One of the main sources of threat is the formulation of the RL technique to address the problem, which is very important for successful learning. Modeling the state space, actions, and also the reward function are major players to guide the agent throughout the learning and make it learn the optimal policy. For example, boundaries defined in discrete states modeling are a threat to internal validity. To mitigate this threat, we used a fuzzy labeling technique to deal with the issue of uncertainty in defining sharp values for boundaries. Regarding the actions, the formulation of actions affects the granularity of the exploration steps, thus we tried to define actions in a way to provide reasonable granularity for the exploration steps.

Internal. There are a number of threats to the internal validity of the results. RL techniques like many other machine learning algorithms are influenced by their hyperparameters such as learning rate and discount factor. During our efficiency and adaptivity analysis experiments, we did not change the learning parameters, we also conducted a set of controlled experiments to study the influence of learning parameters on the efficiency of our approach.

The insufficient number of learning episodes/iterations could also act as a source of threat in the initial learning. To alleviate this threat, we iterated the initial learning sufficiently to ensure convergence. Moreover, using a performance simulation module instead of executing SUTs actually is considered as a source of threat to the validity of results.

Finally, model-free RL is mainly intended to solve a decision-making problem (to find an optimal policy to behave) without access to a model of the environment. Therefore, not considering the structure of the environment might be a source of threat in case of improper formulation of the RL technique.

External. Model-free RL learns the optimal policy to achieve the target through interaction with the environment. Our approach was formulated based on the SUTs with three types of performance sensitivity that are CPU-intensive, memory-intensive, and disk-intensive, and our results are derived from the experimental evaluation of our approach on these types of SUTs. If the experiment contains SUTs with other types of performance sensitivity such as network-intensive programs, then the approach needs to be reformulated slightly to support new types of performance sensitivities.

Moreover, the dependency of the performance simulation module on the performance sensitivity values could raise a threat to validity in the case of deploying the smart tester agent with the performance simulation module. The performance simulation module requires the performance sensitivity values for the SUTs as we described in our experiments. However, given a real deployment of the approach, e.g., in a cloud-based testing setup without the performance simulation module, the dependency on the performance sensitivity values is lighter and their exact values are not necessary. Nonetheless, it is still considered a source of threat.

9 Related work

Measurement of performance metrics under typical or stress test execution conditions, which involve both workload and platform configuration aspects (Menasc’e 2002; Hill, Schmidt, Edmondson & Gokhale 2009; Apte et al. 2017; Michael et al. 2017; Jindal et al. 2019), detection of performance-related issues such as functional problems or violations of performance requirements emerging under certain workload or resource configuration conditions (Briand et al. 2005; Zhang et al. 2011; Ayala-Rivera et al. 2018; Schulz et al. 2019) are common objectives of different types of performance testing.

Different approaches have been proposed to design the target performance test cases for accomplishing performance-related objectives such as finding intended performance breaking points. Performance test conditions involve both workload and resource configuration status. A general high-level categorization of main techniques for generating the performance test cases is as follows:

Source code analysis. Deriving workload-based performance test conditions using data-flow analysis and symbolic execution are examples of techniques for designing fault-inducing performance test cases based on source code analysis to detect performance-related issues such as functional problems (like memory leaks) and performance requirements violations (Yang and Pollock 1996; Zhang et al. 2011).

System model analysis. Modeling the system behavior in terms of performance models like Petri nets and using constraint solving techniques (Zhang and Cheung 2002), using the control flow graph of the system and applying search-based techniques (Gu and Ge 2009; Di Penta et al. 2007), and using other types of system models like UML models and using genetic algorithms (Garousi 2010; Garousi 2008; Garousi et al. 2008; Costa et al. 2012; da Silveira et al. 2011) to generate the performance test cases are examples of the techniques based on system model analysis for generating performance test cases.

Behavior-driven declarative techniques. Using a Domain Specific Language (DSL) to provide declarative goal-oriented specifications of performance tests and model-driven execution frameworks for automated execution of the tests (Ferme and Pautasso 2018; Ferme and Pautasso 2017; Walter et al. 2016), and using a high-level behavior-driven language inspired from Behavior-Driven Development (BDD) techniques to define test conditions (Schulz et al. 2019) in combination with a declarative performance testing framework like BenchFlow (Ferme and Pautasso 2017) are examples of behavior-driven techniques for performance testing.

Modeling realistic conditions. Modeling the real user behavior through stochastic form-oriented models (Draheim et al. 2006; Lutteroth and Weber 2008), extracting workload characteristics from the recorded requests and modeling the user behavior using, e.g., extended finite state machines (EFSMs) (Shams et al. 2006) or Markov chains (Vogele et al. 2018), sandboxing services and deriving a regression model of the deployment environment based on the data resulting from sandboxing to estimate the service capacity (Jindal et al. 2019), end-user clustering based on the business-level attributes extracted from usage data (Maddodi et al. 2018), and using automated GUI testing tools with capture and replay techniques to generate realistic interactive usage sequences (Adamoli et al. 2011) are examples of techniques based on modeling the realistic conditions to generate the performance test cases.

Machine learning-enabled techniques. Machine learning techniques such as supervised and unsupervised algorithms mainly work based on building models and extracting patterns (knowledge) from the data, while some other techniques such as RL algorithms are intended to train the learner agent to solve the problems (tasks). The agent learns an optimal way to achieve an objective through interacting with the system. Machine learning has been widely used for the analysis of data resulting from performance testing and also for performance preservation. For example, anomaly detection through analysis of performance data, e.g., resource usage, using clustering techniques (Syer et al. 2011), predicting reliability from the testing data using Bayesian Networks (Avritzer et al. 2008), performance signature identification based on performance data analysis using supervised and unsupervised learning techniques (Malik et al. 2013; Malik et al. 2010), and also adaptive RL-driven performance in particular response time control for cloud services (Ibidunmoye et al. 2017; Veni and Bhanu 2016; Jamshidi et al. 2016) and also software on other execution platforms, e.g., PLC-based real-time systems (Moghadam et al. 2018). Machine learning has been also applied to the generation of performance test cases in some studies. For example, using symbolic execution in combination with an RL algorithm to find the worst-case execution path within a SUT (Koo et al. 2019), using RL to find a sequence of input workload leading to performance degradation (Ahmad et al. 2019), and feedback-driven learning to identify the performance bottlenecks through extracting rules from execution traces (Grechanik, Fu and Xie 2012). There are also some adaptive techniques slightly analogous to the concept of RL for generating performance test cases. For example, an adaptive workload generation that adapts the workload dynamically based on some pre-defined adjustment policies (Ayala-Rivera et al. 2018), and a feedback-driven approach that uses search algorithms to benchmark an NFS server based on varying workload parameters to find the workload peak rate reaching the target response time confidence level.

10 Conclusion

Performance testing is a family of techniques commonly used as part of performance analysis, e.g., estimating performance metrics or detecting performance violations. One important goal of performance testing, particularly in mission-critical domains, is to verify the robustness of the SUT in terms of finding performance breaking point. Model-driven techniques might be used for this purpose in some cases, but drawing a precise model of the performance behavior of a complex software system under different application-, platform- and workload-based affecting factors is difficult. Furthermore, such modeling might disregard important implementation and deployment details. In software testing, source code analysis, system model analysis, use-case based design, and behavior-driven techniques are some common approaches for generating performance test cases. However, source code or other artifacts might not be available during the testing.

In this paper, we proposed a fuzzy reinforcement learning-based performance testing framework (SaFReL) that adaptively and efficiently generates the target performance test cases resulting in the intended performance breaking points for different software programs, without access to source code and system models. We used Q-learning augmented by fuzzy state modeling and an action selection strategy adaptation that resulted in a self-adaptive autonomous tester agent. The agent can learn the optimal policy to achieve the target (reaching the intended performance breaking point), reuse its learned policy when deployed to test similar software, and adapt its strategy when targeting software with different characteristics.

We evaluated the efficiency and adaptivity of SaFReL through a set of experiments based on simulating the performance behavior of various SUT programs. During the experimental evaluation, we tried to answer how efficiently and adaptively SaFReL can perform testing of different SUT programs compared to a typical stress testing approach. We also performed a sensitivity analysis to explore how the efficiency of SaFReL is affected by changing the learning parameters.

We believe that the main strengths of using the intelligent automation offered by SaFReL are 1) efficient generation of test cases and reduction in computation time, and 2) less dependency on source code and models. Regarding applicability, we believe that SaFReL could be beneficial to the testing of software variants, evolving software during the (CI/CD) process, and regression performance testing. Applying some heuristics and techniques to speed up the exploration of the state space like using multiple cooperating agents, and also extending the proposed approach to support workload-based performance test cases are further steps to continue this research.