Introduction

With the amount of data rapidly increasing, many applications need higher-performance hardware to support their running. However, for most individuals and organizations, these hardware are too expensive, and their budgets are limited. Consequently, the cloud computing price scheme based on pay-as-you-go is a very suitable choice for users to migrate their applications to the cloud, and using cloud can reduce the cost of purchasing and maintaining hardware. There are three major service models for cloud computing: Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) [1]. Amazon Elastic Compute Cloud (EC2) [2] is one of the representatives of IaaS, which can provide users with basic hardware resources, such as CPU, network, memory and storage.

Amazon EC2 provides users with many purchase options. They can be divided into three categories: reserved instances, on-demand instances and spot instances [2]. Reserved instances allow users to purchase a long-term right to use instances at a lower price, which are cost-effective for long-term applications. On-demand instances cost more than reserved instances, however users can purchase it according to the actual running time of their applications, so they are suitable for short-term applications. Due to the fluctuation of user requirements, Amazon sets a lot of redundant resources to respond to the peak of users’ requirements, but in most of time, many resources are idle. Therefore, in December 2009, Amazon proposed a new type of instances called spot instances [3] to sell these idle resources in order to improve resource utilization.

Fig. 1
figure 1

Spot price in us-east-2a c4.xlarge instance (Oct, 27 to Nov, 26 2017)

Spot instances allow users to propose a bid, which is the maximum price user can afford. As we all know, the spot instance price is mainly affected by the fluctuation of market supply and demand. When the price of spot instance is not higher than user’s bid, the user can obtain the right to use this instance. In this case, the price paid by user is not his bid, but the actual price of the spot instance. When the price of a spot instance changes to be higher than the user’s bid, the user’s instance will be interrupted by force. In this case, Amazon will give the user a two-minute warning time [3], during which the user can save or migrate his data. Figure 1 shows the spot price history of c4.xlarge instance in us-east-2a region from October 27 to November 26, 2017, and this spot price is fluctuating all the way. If user’s bid is just equal to on-demand price, the bid is usually successful, for the reason that the spot price is lower than bid for most of the time. But at time points a and b in Fig. 1, the spot price is higher than bid, so out-of-bid event will happen and user’s instance will be interrupted.

Amazon believes that fair use of spot instances can save up to 90% of the cost of on-demand instances [3]. What’s more, we can see from Fig. 1 that the cost of using spot instances is indeed less than on-demand instances. However, due to the large fluctuation of spot instance price, out-of-bid events, like failing to bid, may happen. Therefore, if the price of spot instance can be predicted accurately in advance, users will effectively solve the problems of bid setting and instance selection. Thus, it will help users save a considerable amount of money, and improve reliability obviously. In addition, users can know the price trend of spot instance in the future by predicting the price in advance, and can reasonably arrange their purchase time and give a suitable bid to avoid high cost caused by high bid and instance unavailability caused by low bid. Therefore, the future price needs to be predicted based on the historical price data. The user can obtain the historical price of spot instance on Amazon EC2 dashboard [4], or they can also obtain the historical price through some methods provided by Amazon, like Boto, an AWS SDK for Python [5], and the AWS Command Line Interface (AWS CLI) [6], an open source tool built on top of Boto which provides a consistent interface for interacting with all parts of Amazon Web Services (AWS).

The spot instance was firstly proposed by Amazon, and then, other cloud service providers also proposed spot instances, such as Alibaba and Tencent. Amazon’s spot instance is still the most popular, widely used and representative in the cloud market. Thus, in this paper, we propose a price prediction method for spot instance, and taking Amazon as the representative to elaborate the method. The whole price prediction process, especially the data preprocessing technologies such as resampling and sliding window, and the evaluation indexes for the price prediction of spot instance, can be completely and easily applied to other spot instances.

The contributions of this paper are mainly as follows:

  1. 1.

    A price prediction model based on k-Nearest Neighbors (kNN) regression is proposed to predict the future price of cloud spot instances.

  2. 2.

    The representative Amazon AWS is taken as a testbed, and the historical price data of spot instance is obtained through AWS CLI [6]. An innovative sliding window method is adopted to preprocess the data.

  3. 3.

    Taking the different real cases into account, we implement and discuss the spot instance price prediction in two scenarios: 1-day-ahead and 1-week-ahead. The accuracy of our model is verified by comparing with several other models.

The rest of this paper is organized as follows. In “Related work” section, we introduce the related work on price prediction. A mathematical description of the spot price prediction problem is given in “Problem definition” section. Then in “Proposed method” section, we elaborate on the price prediction model based on kNN regression. In “Experiments and discussions” section, we describe the experiment setup and perform some experiments to verify the effectiveness of our model. Finally, some conclusions and future works are presented in “Conclusions” section.

Related work

Due to the high complexity of cloud data centers, Fernández-Cerero et al. [7, 8] demonstrate that we are often unable to predict the performance of a data center. In contrast, prices are often predictable, especially for spot instances. Since this kind of instance was proposed by Amazon, more and more researchers are trying to analyze and predict its price. The purpose is to help users to understand the price characteristics of spot instances, and then design a reasonable bidding scheme and a combination of instance selection solutions. Low monetary cost is one of the most important metrics and driving forces for users to using cloud services and hosting their data into cloud, which has been widely studied in cloud instance selection [9,10,11], cloud storage [12,13,14], and scientific workflow scheduling [15, 16]. Therefore, it is important to understand and predict the prices of cloud instances, especially the spot instance.

Agmon et al. [17] analyze the actual price of spot instance and build a price model that could be consistent with the existing price trajectory by designing the price reversely. The result shows that the price of spot instance is likely to be generated most of the time at random within a tight price range via a dynamic hidden reserve price mechanism. During the time of the authors’ study, Amazon may set the spot price via a random AR(1) hidden reserve price mechanism.

Kumar et al. [18] presented a survey of spot instance pricing in cloud ecosystem. An insight into the Amazon spot instances and its pricing mechanism has been presented for better understanding of the spot ecosystem. A large amount of important research papers related to price prediction and modeling, spot resource provisioning, bidding strategy designing etc. are summarized and categorized in this survey. There have been many studies about the prediction of spot instance price. We divide them into two major categories, based on the methods that are used.

Price prediction models based on statistical time series analysis

Javadi et al. [19] study the features of Amazon’s spot instances, and analyze the historical price in hour-in-day and day-in-week. At the same time, a statistical model based on Gaussian distribution is proposed to fit these two distributions. The model contains three to four components to better capture the dynamic changes in price and the duration of price changes for each instance. It is proved through simulation experiments that it has better accuracy in real work environments.

Cai et al. [20] think spot instance price usually has switching regimes, and traditional autoregressive models are not suitable for their forecasting. So they propose two Markov regime-switching autoregressive models: DMRA-AR-L and DMRA-AR-SW. They use 144 days of spot instance price history to do some experiments, and the results show that DMRA-AR-L performs the best when the forecast period is shorter than 24h in most cases, while DMRA-AR-SW is best when the forecast period increases.

In [21], a SARIMA model is established by analyzing historical prices of spot instance. By comparing this model with other price prediction models like mean and naive, using 11 months of data, SARIMA has better accuracy.

Price prediction models based on machine learning

Mishra et al. [22] use linear regression to deal with the spot price prediction problem. The price history length they used is 90 days. Wallace et al. [23] and Agarwal et al. [24] use artificial neural network to predict the price of spot instance. Wallace et al. [23] use 7 mouths of historical data to predict spot price based on a standard multi-layer perception, but only predict the price for one ahead. Agarwal et al. [24] use 90 days of historical price and establish LSTM model to predict the spot instance price. The experimental results show that the effect of [24] is better than [22] and [23].

Khandelwal et al. [25] use 12 months of Amazon EC2 spot historical price to predict the price of 1-day-ahead and 1-week-ahead prices for spot instance by establishing a random forests regression. Experiments are performed and compared with neural network, support vector machine regression, regression tree and other methods. The results indicate that the effect of random forests regression is better than other methods.

Neto et al. [26] proposed a heuristic model that uses checkpoint and restore techniques, and takes price change traces of spot instances as input in a machine learning and statistical model to predict time to revocation. By using a bid strategy and the observed price variation history, their model can be able to predict revocation time with high levels of accuracy.

Singh et al. [27] use the data of each month as a global trend and the previous day’s data as a local periodic change to establish a price prediction model. Simultaneously, they use the gradient descent method to adjust the parameters of the model and use 9 months of spot instance price data to do some experiments. This paper separately forecasts spot price examples for short-term (1 h) and long-term (1 day and 1 week). The results show that the prediction effect in short-term is obviously better than the long-term prediction.

In this paper, we innovatively use a sliding window method to preprocess the historical price data which is obtained by using AWS CLI [6]. Then, we build a price prediction model based on k-Nearest Neighbors (kNN) regression to predict the future price of spot instance. Some experiments are performed to discuss the spot instance price prediction in two scenarios: 1-day-ahead and 1-week-ahead. The accuracy of our model is verified by comparing with several other models.

Problem definition

Fig. 2
figure 2

An example of using sliding window to divide the price data. a Description of price data when the time interval is 1 hour and the sampling time is one day (\(\varvec{p}=[p_1,p_2,\ldots ,p_{24}]\), \(l_p=24\)). b Description of the sample set generation (\(l_{sw}=12, l_{pt}=4\)), and in this case, the data is just being used completely. c Description of the sample set generation (\(l_{sw}=10, l_{pt}=5\)), and in this case, there is excess of data

The historical price of spot instance s is represented as a vector \(\varvec{p}=[p_1,p_2,\ldots ,p_{l_p}]\), \(\varvec{p}\in {\mathbb {\varvec{R}}}^{l_p}\), where \(l_p\) is the length of \(\varvec{p}\), in other words, the length of historical price. For example, in Fig. 2a, the historical price is displayed when the sampling time is 1 day (24 h) and the time interval is 1 h, and in this case, the historical price is represented as a vector (\(\varvec{p}=[p_1,p_2,\ldots ,p_{24}]\)) and its length is 24 (\(l_p=24\)).

In this paper, we use sliding window to divide the price data. We use \(l_{sw}\) and \(l_{pt}\) respectively to denote the length of sliding window and the length of time window to be predicted. In order to ensure the accuracy of data division, each sliding length of the sliding window is set to \(l_{pt}\). So the number of samples we get by sliding the window is:

$$\begin{aligned} n_s=\lfloor \frac{l_p-l_{sw}}{l_{pt}}\rfloor \end{aligned}$$
(1)

In Eq. 1, we get the sample number by rounding down the result, because there is excess of data, and it is necessary to move the sliding window in reverse. Like in Fig. 2b, the number of samples is 3 (\(n_s=3\)) when the length of sliding window is 12 (\(l_{sw}=12\)) and the length of time window to be predicted is 4 (\(l_{pt}=4\)). However, in Fig. 2c the number of samples is 2 (\(n_s=2\)) when the length of sliding window is 10 (\(l_{sw}=10\)) and the length of time window to be predicted is 5 (\(l_{pt}=5\)), and in this case, there is excess of data. We use \({\varvec{D}}=\{D_1,D_2,\ldots ,D_{n_s}\}\) to denote the sample set after sliding, where \(D_i=(\varvec{x_i},\varvec{y_i})\) is the sample data formed after sliding \(n_s-i+1\) times, and \(\varvec{x_i}\in {{\mathbb {\varvec{R}}}^{l_{sw}}}\) is the sample \(D_i\)’s vector, which is the data in sliding window, and \(\varvec{y_i}\in {{\mathbb {\varvec{R}}}^{l_{pt}}}\) is sample \(D_i\)’s label vector. Like in Fig. 2b the sample set is \({\varvec{D}}=\{(\varvec{x_1},\varvec{y_1}),(\varvec{x_2},\varvec{y_2}),(\varvec{x_3},\varvec{y_3})\}\), and in Fig. 2c, \({\varvec{D}}=\{(\varvec{x_1},\varvec{y_1}),(\varvec{x_2},\varvec{y_2})\}\).

$$\begin{aligned} \varvec{x_i}=\, & {} [p_{l_p-(n_s-i+1)l_{pt}-l_{sw}+1},\ldots ,p_{l_p-(n_s-i+1)l_{pt}}] \end{aligned}$$
(2)
$$\begin{aligned} \varvec{y_i}=\,&[p_{l_p-(n_s-i+1)l_{pt}+1},\ldots ,p_{l_p-(n_s-i)l_{pt}}] \end{aligned}$$
(3)

The goal of this paper is to predict the spot instance price, namely, it needs to find a function f satisfied the following formula:

$$\begin{aligned} \varvec{y_i}=f(\varvec{x_i}), 1\le i\le n_s \end{aligned}$$
(4)

Proposed method

To predict the spot instance price, we construct kNN regression model to predict \({\varvec{y}}\) based on new input \(\varvec{X}\).

For the new input \(\varvec{X}\), kNN regression will find k nearest samples in the training set, and the average of their values is used as the predicted value corresponding to \(\varvec{X}\), marked as \(\hat{{\varvec{y}}}\). In our model, we have to determine the following two problems.

The first one is the distance function. In our model, when judging the distance between the new input \(\varvec{X}\) and the training sample \(\varvec{x_i}(1\le i\le n_s)\), we choose Euclidean distance as distance function, which is shown as follows:

$$\begin{aligned} dist(\varvec{x_i},\varvec{X})=\sqrt{\sum _{j=1}^{l_{sw}}(x_i^j-x^j)^2}, 1\le i\le n_s \end{aligned}$$
(5)

The other one is how to improve computational efficiency when the model searches k nearest neighbors on training data. Since the simplest linear scan is very time-consuming, we use k-dimensional tree(k-d tree) as a fast kNN search algorithm. The k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. In element search, the average time complexity of k-d tree is O(logn), and it is O(n) in the worst case. However, but the average time complexity of linear scanning is O(n), thus k-d tree is chosen.

In kNN, the model training process is the k-d tree building process. In algorithm BuildKDTree, k-d tree is built to complete training process. The dimension of the maximum variance is selected (line 4), where the median sample \(\bar{D}\) of \({\varvec{D}}\) is taken as a current node (line 5 and line 7). Then, the remaining samples \(\varvec{D'}\) continue to be divided (line 9–12) until it becomes an empty set (line 1–3). At last, we can get the k-d tree kdTree (line 13). Regarding time complexity, the k-d tree is built recursively in algorithm BuildKDTree, and the time complexity of recursion is \(O(\log {n_s})\). Since the dimension of the maximum variance is calculated in each recursion, the time complexity of this calculation is \(O(l_{sw}n_s)\). Thus the overall time complexity of algorithm BuildKDTree is \(O(l_{sw}n_s\log {n_s})\).

In algorithm Search, the set of k nearest nodes of input \(\varvec{X}\) will get from kNN search process. Firstly, for input \(\varvec{X}\), we need to find k nodes in \({\varvec{D}}\) (line 2–4). In nearest nodes set, if the distance between the farthest node \({\varvec{maxDN}}\) against \(\varvec{X}\) is greater than \({\varvec{node}}\), \({\varvec{maxDN}}\) should be replaced by \({\varvec{node}}\) (line 5 to line 9). Meanwhile, if in axis dimension the value of \(\varvec{X}\) is greater than \({\varvec{node}}\), we should search nearest node in \({\varvec{node}}\)’s left subtree (line 13–16). In this case, if the distance between x[axis] and node[axis] is less than the maximum distance between k nearest nodes set and \(\varvec{X}\), we should search neatest node in \({\varvec{node}}\)’s right subtree (line 17–19). This is a similar case when x[axis] is more than node[axis] (line 21–27). Finally, this algorithm will return the set of k nearest nodes. In terms of time complexity, because algorithm Search needs to get the point that has maximum distance against \(\varvec{X}\) in nearest nodes set, its time complexity is O(\(kl_{sw}\)). The search process for the k-d tree needs to traverse all nodes in the worst case, and its worst time complexity is \(O(n_s)\). Thus the overall time complexity of algorithm Search is \(O(kl_{sw}n_s)\).

figure a

Algorithm kNN aims to realize kNN regression. Firstly, we construct a k-d tree from \({\varvec{D}}\) by using algorithm BuildKDTree (line 1), and then \(\varvec{X}\)’s k nearest samples can be found via algorithm Search (line 2). Finally, by taking the average of k samples (line 3), we can get the prediction price \(\varvec{\hat{y}}\) of \(\varvec{X}\) (line 4). Because the time complexity of line 1 is \(O(l_{sw}n_s\log {n_s})\), line 2 is \(O(kl_{sw}n_s)\) and line 3 is O(k), the time complexity of algorithm kNN is \(O(kl_{sw}n_s)\).

Experiments and discussions

To better evaluate the performance of our model, extensive experiments are conducted in this section. We first introduce the experiment setup, and then describe the experimental results.

Experiment setting

In this section, we first introduce the environments of our experiment, then describe the data acquisition method and data preprocessing process. Finally, the compared algorithms and measurement method will be expressed.

figure b

Experimental environment Experiments are performed on a GUN Linux Operating System with an Intel(R) Core(TM)i5-7500 at 3.40 GHz and 16 GB of RAM memory. Moreover, we use Python3.5 programming language to implement the algorithms.

figure c
Table 1 Abbreviations used for different regions
Table 2 Abbreviations used for different instances

Experimental data Amazon provides SDK which can help people to get the spot price history from web. Thus, we use AWS CLI [6] to access Amazon EC2 and get 88 days of historical price, from September 1, 2017 to November 27, 2017. The regions and instance types involved are described in Tables 1 and 2.

Because the interval of price change is uncertain, we re-sample the historical price in 1-h unit. Users only need to consider the maximum price per hour to make a bid, so we use the maximum value of selected sampling unit as the re-sampling value. After re-sampling, we get 2112 values in every instance (\(24\times 88\) (days)) and 76,032 values in total (\(2112\times 4\)(regions)\(\times 9\)(instances)). In this paper, we divide the dataset into 80% and 20% for the training set and test set, respectively.

Evaluated algorithms In this paper, we use 5 algorithms as comparison methods which are Linear Regression (LR) [22], Support Vector Machine Regression (SVR) , Random Forest (RF) [25], Multi-layer Perception Regression (MLPR) [23] and gcForest [28].

Performance metrics Mean Absolute Percentage Error (MAPE) is a commonly used measurement method for time series forecasting problems. It can measure the outcome of a predictive model. MAPE is defined as follows:

$$\begin{aligned} MAPE=\frac{1}{n} \sum _{i=1}^{n}APE_i \end{aligned}$$
(6)

where n is the sample number of test set, and APE is absolute percentage error whose definition is as follows:

$$\begin{aligned} APE=\frac{1}{l_{pt}}\sum _{i=1}^{l_{pt}}\frac{|y_i-\hat{y_i}|}{y_i}\times 100\% \end{aligned}$$
(7)

In this paper, to make a quantitative estimation, we use \(MAPE_{m\%}\) as performance metrics. \(MAPE_{m\%}\) represents the number of results whose APE is less than or equal to \(m\%\) as a percentage of the number of total results, which is calculated as follows:

$$\begin{aligned} MAPE_{m\%}=\frac{1}{n}\sum _{i=1}^{n} \mathbf{1} (APE_i-m\%), \mathbf{1} (x)=\left\{ \begin{aligned} 1\quad x\le 0 \\ 0\quad x>0 \\ \end{aligned} \right. \end{aligned}$$
(8)

In this paper, the selected value of m is 5.

Experimental results and discussions

Currently, since many applications require a lot of time and money to run, the deployment of local applications to the cloud platform can reduce the cost like hardware purchasing, device cooling, hardware maintenance and so on. For example, if a user deploys a web crawler system, the running time may be several hours or even days. Similarly, in video rendering, videos of different lengths may need different time. Short videos may take several to tens of hours, but larger videos may take several days. Users need to consider the application’s possible running time when migrating their own applications to the cloud, and estimate the price in advance during this period to help in successful biding. Therefore, in this paper, we discuss the spot instance price prediction in two scenarios: 1-day-ahead and 1-week-ahead.

One-day-ahead

Parameter setting In order to maximize the effect of the proposed model, we need to determine the value of k and the sliding window length \(l_{sw}\). We conduct a lot of experiments with different k and \(l_{sw}\) respectively. The experimental results are shown in Fig. 3.

Fig. 3
figure 3

\(MAPE_{5\%}\) variations in different \(l_{sw}\) and k on 1-day-ahead

In this figure, the result of \(k=1\) is the best obviously. \(k=1\) is the nearest neighbor regression. With the increase of k, the estimation error will increase, because the training sample that is far away from the input \(\varvec{X}\) will affect the result to be worse. So we choose the k to equal 1.

With the length of sliding window increasing, higher dimensions will not increase the advantages of different examples, but lead to the reduction of accuracy of the results. From the Fig. 3, we can see that the best effect is when the length of sliding window is equal to the length of prediction length.

Thus we choose the length of sliding window \(l_{sw}=24\).

Experimental result Based on the above settings, we make a forecast for 1-day-ahead. The results are shown in Tables 3 and 4.

Table 3 Comparison of different methods for regions on 1-day-ahead
Table 4 Comparison of different methods for instances on 1-day-ahead

We can see from these two tables, kNN regression has the best results. RF and gcForest are the better ones among the methods being compared, and the \(MAPE_{5\%}\) of them are respectively \(85.61\%\) and \(83.86\%\) on average, but kNN is \(94.00\%\) which can achieve about \(10\%\) improvement. The effect of MLPR and SVR are both bad, which is like [25].

One-week-ahead

Parameter setting We can see from Fig. 4, like 1-day-ahead prediction, the effect is best when k is equal to 1 and the length of sliding window is equal to the length of prediction window. So we select \(k=1\) and \(l_{sw}=168\) in 1-week-ahead prediction.

Fig. 4
figure 4

\(MAPE_{5\%}\) variations in different \(l_{sw}\) and k on 1-week-ahead

Table 5 Comparison of different methods for regions on 1-week-ahead
Table 6 Comparison of different methods for instances on 1-week-ahead

Experimental result Based on the above settings, we make a forecast for 1-week-ahead. The results are shown in Tables 5 and 6.

From these two tables, kNN regression can achieve best results except instance I7. For I7, the best result is RF whose value is \(99.93\%\), \(0.15\%\) higher than kNN’s \(99.78\%\). But in \(MAPE_{10\%}\), the results of kNN and RF are both \(100\%\). So on the whole, kNN is the best method. RF and gcForest are the better ones among the methods being compared, and the \(MAPE_{5\%}\) of them are respectively \(86.50\%\) and \(89.80\%\). In comparison, kNN is \(94.06\%\) which can achieve about \(6\%\) improvement.

According to the mechanisms of spot instance, the final user is allowed to propose a bid. When the price of spot instance is not higher than this bid, the user can use this instance, otherwise the user fails. In addition, when the user is using a spot instance and the price becomes higher than the bid, out-of-bid will happen and the user’s instance will be interrupted by force. However, the fluctuation of spot instance price brings great inconvenience and difficulty to final users. Thus, by predicting the future price of spot instance, the proposed method is useful and helpful for final users from following aspects: (a) setting a proper bid or designing a reasonable bidding scheme; (b) choosing the right time to purchase spot instance; (c) selecting the most appropriate spot instance or a combination of them; (d) avoiding out-of-bid events and instance unavailability.

Conclusions

Many cloud providers like Amazon provide users with three main types of instances: reserved instances, on-demand instances and spot instances. Compared with others, the price of spot instance is the lowest, but its fluctuating price is an obstacle for users. Predicting the price of spot instance in advance can help users to know the price trend in the future. Users can reasonably arrange the purchase time and give a suitable bid to avoid high cost caused by high bid and instance unavailability caused by low bid. So it is very important and challenging to predict the spot instance price in advance. In this paper, we give a mathematical description of spot instance prediction problem and use the price history of Amazon EC2 spot instance to predict future price by building a kNN regression model. What’s more, to better evaluate the performance of our model, we use 88 days of spot instance price of 4 regions and 9 instances to perform many experiments. We compare our model with LR, SVR, RF, MLPR and gcForest. Evaluation results show that the \(MAPE_{5\%}\) is up to \(94.00\%\) in 1-day-ahead prediction and \(94.06\%\) in 1-week-ahead, respectively. In both of these two scenarios, our method achieves better performance than other methods. The method proposed in this paper is applicable to the spot instance price prediction of other cloud providers. Helping users to select appropriate instances [9,10,11] based on price prediction and combining cloud data storage [12] with cloud instance types selection are two directions for the future work.