1 Introduction

Privacy-enhancing techniques and protocols for data aggregation and analytics in wireless networks require novel methods for efficient and privacy-preserving computation of distributed queries with the protection of outcomes from active attackers. Research on this topic belongs to the general area of distributed privacy-preserving data mining.

Previously, efficient specialised algorithms for distributed data mining and machine learning have been developed, for example, in [1,2,3,4]. On the other hand, the protection of privacy is also crucial for successful applications of novel cyber technologies [5,6,7]. For example, privacy-preserving techniques have been investigated for data publishing [8], service selection [9], and trend surface analysis [10].

In this context, the present paper investigates the general situation where it is necessary to ensure privacy and at the same time provide answers to statistical or numerical aggregate queries over a large distributed dataset, which is a union of several separate subsets such that the managers of the subsets are not allowed to disclose the content of their data or transfer their data to other entities or competing organisations operating in the same wireless network.

Denote by \({\mathscr {D}}\) the whole distributed dataset, which is a union of the subsets \({\mathscr {D}}_1, \dots , {\mathscr {D}}_M\) supervised by different managers \({\mathscr {M}}_1, \dots , {\mathscr {M}}_M\), where M is the number of the subsets. We assume that the dataset is horizontally distributed, i.e., each record belongs to one of the subsets \({\mathscr {D}}_1, \dots , {\mathscr {D}}_M\). For a positive integer n, we denote by [1 : n] the set \(\{1, 2, \dots , n\}\).

A client submits a query to the system being organised by the managers. The managers process the query following the protocol and return the final outcome to the client. The client does not have access to the records of the dataset, because they contain confidential information. For \(m \in [1: M]\), each manager \({\mathscr {M}}_m\) has access only to the records of the corresponding dataset \({\mathscr {D}}_m\), but has no right to access the datasets of the other managers.

The dataset managers processing distributed queries represent official established entities that own their parts of the data contained in the distributed dataset. As a typical important example, the managers may official representatives of different organisation participating in the wireless network. Therefore, it is natural to assume that the individual dataset managers are honest. Nevertheless, they cannot share confidential information information of individual entries in their part of the dataset with the representatives of other competing organisations.

An efficient noise addition framework for privacy preserving data mining was proposed in [11,12,13,14]. An algorithm for the private processing of distributed queries was proposed in [15]. The present paper considers a larger class of queries. The algorithm proposed in [15] was the first and only algorithm applicable in the situation considered in the present paper. However, the procedure proposed in [15] does not provide protection against active attackers and the previous algorithm cannot solve the problem considered in the present paper. The readers are referred to [15] for additional examples of other previous related publications.

We propose solutions to this problem by developing new protocols, which employ methods different from those used in [15]. Besides, the present paper handles a larger class of numerical queries in comparison with [15]. In particular, vector functions considered in this paper are more general than the scalar functions used in [15], and our new class of functions treated by our protocols in the present paper is larger than the one considered in [15].

In our model, the managers use separate servers for secure computation. It is likely that every participant would prefer to be equally involved in the process of secure computation. There are no reasons to introduce a single trusted authority handling one server for all secure computations. Instead, we assume that each manager introduces an individual server, where all managers have to communicate intermediate values in order to organize the process. Accordingly, we assume that the managers \({\mathscr {D}}_1, \dots , {\mathscr {D}}_M\) introduce the servers \(S_1, \dots , S_M\), where the server \(S_m\) belongs to \({\mathscr {M}}_m\), for \(m \in [1: M]\).

However, the problem of protecting the outcomes of distributed queries from active attackers has not been considered in this setting. This problem is important because the servers are likely to be targeted by active outsider attackers as they are new and each of them is involved in communication with the managers of all subsets and contains a lot of confidential information contributed by all the individual managers. This is why the problem of protecting the outcomes of distributed queries from active outsider attackers is paramount. This problem has not been considered before.

When the active attackers compromise several of the servers \(S_1, \dots , S_M\), this leads to occurrences of Byzantine faults in the compromised servers. Let k be a positive integer, \(k < M\), equal to the largest number of the servers \(S_1, \dots , S_M\) which can be compromised by active outsider attackers. This integer is an input parameter to our protocols.

The aim of this paper is to propose solutions to the problem of protecting the outcomes of distributed queries from active outsider attackers. We define a new large class of distributed queries to be handled by our protocols. A formal definition of this class is given in Sect. 3. This class contains many statistical queries of practical value. We propose solutions to the problem of protecting all queries from this class against active outsider attackers. We introduce two recursive protocols for the Protection against Active Attackers (PAA). The two variants of our recursive PAA protocols are the PAA, applying Shamir’s Secret Sharing (PAA-SSS), and PAA, applying homomorphic encryption (PAA-HE). The latter combines the ElGamal and Paillier encryption schemes in order to handle certain steps of the whole system. Theoretical analysis and the results of our experiments show that (i) both protocols significantly outperform different more straightforward approaches, and (ii) PAA-HE provides stronger protection to the query outcomes, but is slower than PAA-SSS.

The paper comprises the following sections. Section 3 explains the PAA-SSS and PAA-HE protocols. It introduces the class \({\mathscr {C}}\) of queries handled by the protocols, explains iterations and steps of the PAA-SSS and PAA-HE protocols, and shows that the class \({\mathscr {C}}\) contains many important statistical queries. The experiments comparing PAA-SSS and PAA-HE with other algorithms are presented in Sect. 4. Section 5 concludes the paper.

2 Preliminaries

2.1 ElGamal encryption scheme for \({\mathscr {M}}_1, \dots , {\mathscr {M}}_M\)

Following [16, Sect. 2.3], here we define succinct notation for the ElGamal encryption scheme introduced in [17]. Each manager \({\mathscr {M}}_\ell\), \(\ell \in [1: M]\), chooses a secret key \(\mathsf {sk}_\ell\) and a public key \({\mathsf {pk}}_\ell\), as explained in [16, Sect. 2.3]. Let us denote by \(c = \mathsf {E}_{EG}(t, \mathsf {pk})\) the ElGamal encryption of a plaintext t. We denote by \(t = \mathsf {D}_{EG}(c, \mathsf {sk}_\ell ))\) the ElGamal decryption of c.

For any plaintexts \(t_1, \dots , t_M\), the ElGamal cryptosystem satisfies the following homomorphic property:

$$\begin{aligned} \prod ^M_{m = 1} \mathsf {E}_{EG}(t_m, \mathsf {pk}_\ell ) = \mathsf {E}_{EG}\left( \prod ^M_{m = 1} t_m, \mathsf {pk}_\ell \right) . \end{aligned}$$
(1)

For more explanations and examples, the readers are referred to [16, Sect. 2.3].

2.2 Paillier encryption scheme for \({\mathscr {M}}_1, \dots , {\mathscr {M}}_M\)

Following [16, Sect. 2.4], we introduce concise notation for the Paillier encryption scheme introduced in [18]. Each manager \({\mathscr {M}}_\ell\), \(\ell \in [1: M]\), chooses a secret key \(\mathsf {sk}'_\ell\) and a public key \(\mathsf {pk}'_\ell\), as explained in [16, Sect. 2.4]. We denote by \(c = E_P(t, \mathsf {pk}'_\ell )\) the Paillier encryption of the plaintext t. Let us denote by \(D_P(c, \mathsf {sk}'_\ell )\) the Paillier decryption of c.

For any plaintexts \(t_1, \dots , t_M\), the Paillier encryption scheme satisfies the following homomorphic property:

$$\begin{aligned} \prod ^M_{m = 1} \mathsf {E}_{P}(t_m, \mathsf {pk}'_\ell ) = \mathsf {E}_{P}\left( \prod ^M_{m = 1} t_m, \mathsf {pk}'_\ell \right) . \end{aligned}$$
(2)

For more details and examples, the readers are referred to [16, Sect. 2.4].

2.3 Shamir’s secret sharing for \({\mathscr {M}}_1, \dots , {\mathscr {M}}_M\)

To apply Shamir’s secret sharing [19] as explained in [20], the managers choose a finite field \({\mathscr {F}}\) with \(\Vert {\mathscr {F}}\Vert > M\) and with a primitive M-th root of unity, \(\alpha \in {\mathscr {F}}\), \(\alpha ^M = 1\). All values of the data are represented as elements of \({\mathscr {F}}\). They compute \(\xi _1 = \alpha ^0\), \(\xi _2 = \alpha ^1, \dots ,\) \(\xi _M = \alpha ^{M - 1}\) in \({\mathscr {F}}\).

Suppose that each manager \({\mathscr {M}}_m\), \(m \in [1: M\)], has a secret value \(y_m\). To introduce it to the process, Shamir’s secret sharing [19] is used as follows. Recall that the smallest integer that is greater than or equal to x is denoted by \(\lceil x \rceil\). Putting \(k = \lceil M/2 \rceil - 1\), the \({\mathscr {M}}_m\) selects k random elements \(u_{m, 1}, \dots , u_{m, k} \in {\mathscr {F}}\), defines the polynomial \(g_m(x) = y_m + u_{m,1} x + \cdots + u_{m, k} x^k\), and for all \(m' \in [1: M]\) sends each value \(\Omega _{m'}(y_m) = g_m(\xi _{m'})\) as a secret share to \({\mathscr {M}}_{m'}\). The secret value \(y_m\) has been split into secret shares

$$\begin{aligned} \Omega _1(y_m) = g_m(\xi _{1}), \dots , \Omega _M(y_m) = g_m(\xi _{M}). \end{aligned}$$
(3)

These shares encode \(y_m\), because Lagrange’s interpolation formula

$$\begin{aligned} g_m(x)&= \sum ^k_{m' = 1} \left( \Omega _{m'}(y_m) \frac{ \prod _{ m'' \in [1: k], m'' \ne m' } x - \xi _{m''} }{ \prod _{ m'' \in [1: k], m'' \ne m' } \xi _{m'}- \xi _{m''} } \right) \end{aligned}$$
(4)

restores the polynomial \(g_m(x)\) from (3) and recovers \(y_m = g_m(0)\).

It is explained in [20] how each manager \({\mathscr {M}}_m\), \(m \in [1: M\)], can privately compute two new values \(\Theta _m(y_1, \dots , y_M)\) and \(\Delta _m(y_1, \dots , y_M)\), which encode the sum \(\sum ^M_{m' = 1} y_{m'}\) and the product \(\prod ^M_{m' = 1} y_{m'}\) as their corresponding secret shares, respectively. In order to refer to these values, we denote them by

$$\begin{aligned} \Theta _m(y_1, \dots , y_M)&= \Omega _m \left( \sum ^M_{m' = 1} y_{m'}\right) , \end{aligned}$$
(5)
$$\begin{aligned} \Delta _m(y_1, \dots , y_M)&= \Omega _m \left( \prod ^M_{m' = 1} y_{m'}\right) . \end{aligned}$$
(6)

3 The PAA protocols

Table 1 Main notation used in this paper

We consider the general case in which the client communicates to the managers of the distributed dataset \({\mathscr {D}}\) a query of the form \((\varphi , B)\), where B is a finite set of Boolean expressions indicated by the client for choosing vectors to be included in the query, and \(\varphi\) is a vector function selected by the client for computing the query outcome. The main notations used in our protocols are listed in Table 1.

The queries handled by the PAA-SSS and PAA-HE protocols incorporate Boolean expressions B. They are indicated by the client and used to select subsets of records or vectors in \({\mathscr {D}}\). Each of these expressions can be applied to any vector \({\mathbf {v}}\) in \({\mathscr {D}}\) and produces TRUE or FALSE for each vector.

Denote by \({\mathscr {B}}\) the class of all Boolean expressions considered in this paper. The class \({\mathscr {B}}\) is defined recursively as follows. First, \({\mathscr {B}}\) contains all Boolean basic expressions of the form \(\psi ({\mathbf {v}}) = \varphi ({\mathbf {v}})\), \(\psi ({\mathbf {v}}) < \varphi ({\mathbf {v}})\), \(\psi ({\mathbf {v}}) \le \varphi ({\mathbf {v}})\), \(\psi ({\mathbf {v}}) > \varphi ({\mathbf {v}})\), \(\psi ({\mathbf {v}}) \ge \varphi ({\mathbf {v}})\), where \(\psi\) and \(\varphi\) are any numerical functions defined for any vector \({\mathbf {v}}\) in \({\mathscr {D}}\). Second, if \(B_1, B_2 \in {\mathscr {B}}\), then the following expressions also belong to \({\mathscr {B}}\): \(\lnot B_1\), \(B_1 \wedge B_2\), \(B_1 \vee B_2\), \(B_1 \mid B_2\), \(B_1 \rightarrow B_2\), \(B_1 \leftrightarrow B_2\) where \(\lnot\) (NOT), \(\wedge\) (AND), \(\vee\) (OR), \(\mid\) (XOR), \(\rightarrow\) (implication), and \(\leftrightarrow\) (equivalence) are the well-known Boolean operators. These two rules recursively define all Boolean expressions in the class \({\mathscr {B}}\).

Given a finite set \(B \subseteq {\mathscr {B}}\) specified by the client, let us denote by \(T = B({\mathscr {D}})\) the set of records or vectors selected in the dataset \({\mathscr {D}}\) by the set B of Boolean expressions. The set T consists of all vectors \({\mathbf {v}} \in {\mathscr {D}}\) such that \(B({\mathbf {v}}) = \text{ TRUE }\) for all \(B \in {\mathscr {B}}\).

It is easy for the managers to apply the Boolean expressions in the finite set B to their separate datasets, because it follows from the recursive definition given above that every Boolean expression applies to each vector of the dataset considered individually in isolation from other vectors. Therefore, every manager \({\mathscr {M}}_m\), \(m \in [1: M]\), can select all vectors of the corresponding subset \({\mathscr {D}}_m\) locally without consulting any other manager. For \(m \in [1: M]\), denote by \(T_m = B({\mathscr {D}}_m)\) the subset consisting of all vectors in \({\mathscr {D}}_m\) satisfying all Boolean expressions in the finite set B. The set \(T_m\) consists of all vectors \({\mathbf {v}} \in {\mathscr {D}}_m\) such that \(B_1({\mathbf {v}}) = \text{ TRUE }\) for all \(B_1 \in B\). Let \(R = |T|\) be the cardinality of the set T, and let \(R_m = |T_m|\) be the number of vectors in \(T_m\). Then \(R = \sum ^M_{m = 1} R_m\) and \(T = T_1 \dot{\cup } T_2 \dot{\cup } \dots \dot{\cup } T_M\) is a disjoint union of the sets \(T_1, T_2, \dots , T_M\).

Denote all vectors in the set \(T = B({\mathscr {D}})\) by \({\mathbf {v}}_1, \dots , {\mathbf {v}}_R\). Let C be the number of components or coordinates in every vector \({\mathbf {v}}\) of the whole dataset \({\mathscr {D}}\). Denote the components of the vector \({\mathbf {v}} \in {\mathscr {D}}\) by \({v}_1\), \({v}_2, \dots , {v}_C\). This means that \({\mathbf {v}} = (v_1, v_2, \dots , v_C) \in {\mathscr {D}}\). For \(r \in {[1: R]}\) and \({\mathbf {v}}_r \in T\), denote the components of the vector \({\mathbf {v}}_r\) by \(v_{r, 1}, \dots , v_{r, C}\). Then we have \({\mathbf {v}}_r = ({v}_{r,1}, {v}_{r,2}, \dots , {v}_{r, C}) \in T\).

Intuitively, the class \({\mathscr {C}}\) of queries handled by our protocols consists of all pairs \((\varphi , B)\), where B is a finite set of Boolean expressions, and where \(\varphi\) is any vector function, which applies to the set \(T = B({\mathscr {D}})\) and which can be defined by using some vector functions of individual vectors in T, the symbols of sum \(\sum ^{R}_{r=1}\) and product \(\prod ^{R}_{r=1}\), as well as a compound vector function combining them.

More formally, the class \({\mathscr {C}}\) is defined as the set of all pairs \((\varphi , B)\), where B is a Boolean expression and \(\varphi\) is a vector function defined as follows. Take any positive integers L and d. For \(\ell \in [1: L]\), let \(d_\ell\), \(e_\ell\) be positive integers. Let \(\varphi _1\) be a vector function from \({\mathbb {R}}^C\) to \({\mathbb {R}}^{d_1}\). Let \(\psi _1\) be a vector function from \({\mathbb {R}}^C\) to \({\mathbb {R}}^{e_1}\). Finally, let \(\psi\) be a function from \({\mathbb {R}}^{d_L + e_L}\) to \({\mathbb {R}}^{d}\). The function \(\varphi\) is defined recursively by the following equalities

$$\begin{aligned} \varphi (T)&= \psi \left( \sum ^R_{r=1} \varphi _{\ell }, \prod ^R_{r=1} \psi _{\ell } \right) , \end{aligned}$$
(7)
$$\begin{aligned} \varphi _\ell&= \varphi _\ell \left( {\mathbf {v}}_r, \sum ^R_{r=1} \varphi _{\ell -1}, \prod ^R_{r=1} \psi _{\ell -1} \right) , \end{aligned}$$
(8)
$$\begin{aligned} \psi _\ell&= \psi _\ell \left( {\mathbf {v}}_r, \sum ^R_{r=1} \varphi _{\ell -1}, \prod ^R_{r=1} \psi _{\ell -1} \right) , \end{aligned}$$
(9)

where, for \(\ell \in [2, L]\), \(\varphi _\ell\) is a function from \({\mathbb {R}}^{C + d_{\ell -1} + e_{\ell -1}}\) to \({\mathbb {R}}^{d_\ell }\), and \(\psi _\ell\) is a function from \({\mathbb {R}}^{C + d_{\ell -1} + e_{\ell -1}}\) to \({\mathbb {R}}^{e_\ell }\).

The class \({\mathscr {C}}\) contains many useful and well-known numerical queries. Indeed, the following smaller class \({\mathscr {K}}\) was defined in [15]. It consists of all functions \(\varphi (T)\) given by

$$\begin{aligned} \varphi (T)&= g \left( \sum ^{R}_{r=1} g_1({\mathbf {v}}_r), \dots , \sum ^{R}_{r=1} g_{d_1}({\mathbf {v}}_r) \right) , \end{aligned}$$
(10)

where g is a function with \(d_1\) arguments and \(g_1, \dots , g_{d_1}\) are scalar functions with C arguments each. If we put \(L = 1\),\(\psi = g\) and \(\psi _1({\mathbf {v}}_r) = (g_1({\mathbf {v}}_r), \dots , g_{d_1}({\mathbf {v}}_r))\), then we get \(\varphi (T) = \psi \Bigl ( \sum ^{R}_{r=1} \psi _1({\mathbf {v}}_r) \Bigr )\), which is a special case of (7). Therefore \({\mathscr {K}} \subseteq {\mathscr {C}}\). It was explained in [15] that \({\mathscr {K}}\) contains the mean, variance, standard deviation, the coefficient of variation, the sample covariance, and the Pearson product-moment correlation coefficient (cf. [21]). It follows that all these functions also belong to \({\mathscr {C}}\).

Another example of a query in \({\mathscr {C}}\) is given by any pair \((\varphi _1, B)\), where \(\varphi _1\) is the geometric mean defined by

$$\begin{aligned} GM(v_{1,1}, \dots , v_{R,1})= & {} \root R \of {\prod ^{R}_{r=1} v_{r,1}}. \end{aligned}$$
(11)

It belongs to \({\mathscr {C}}\), since it is determined by (7) with \(\ell = 1\), \(\varphi _1({\mathbf {v}}_r) = 1\), \(\psi _1({\mathbf {v}}_r) = v_{r,1}\), \(\psi (x, y) = \root x \of {y}\). It is obvious, that \(GM(v_{1,1}, \dots , v_{R,1})\) does not belong to \({\mathscr {K}}\). Therefore, \({\mathscr {K}}\) is strictly included in \({\mathscr {C}}\).

3.1 The PAA-SSS protocol

All steps of the PAA-SSS protocol are formally described in Algorithm 1. Let us introduce concise auxiliary notation used in Algorithm 1.

figure d

For \(m \in [1: M+1]\), put \(\gamma _m = \sum ^{m - 1}_{m' = 1} R_{m'}\). In particular, \(\gamma _1 = 0\). Without loss of generality, we may assume that all vectors of T are indexed so that the vectors of \(T_1\) are indexed first and occur in succession one after another, then the vectors of \(T_2\) follow, and so on. It follows that we can denote all vectors of \(T_m\) by \({\mathbf {v}}_{\gamma _m + 1}, \dots , {\mathbf {v}}_{\gamma _m + R_m}\). Then we get

$$\begin{aligned} T_m= & {} \{ {\mathbf {v}}_{\gamma _m + 1}, {\mathbf {v}}_{\gamma _m + 2}, \dots , {\mathbf {v}}_{\gamma _m + R_m} \}. \end{aligned}$$
(12)

To be able to treat \(\varphi _1\), \(\psi _1\) in the same way as the other functions in the iterations of Algorithms 1 and 2, let us define \(\varphi _1({\mathbf {v}}_r, 0, 0) = \varphi _1({\mathbf {v}}_r)\) and \(\psi _1({\mathbf {v}}_r, 0, 0) = \psi _1({\mathbf {v}}_r)\). In the beginning of Algorithm 1, the managers set up Shamir’s Secret Sharing scheme as indicated in line 1 and as explained in Sect. 2.3. Line 2 of Algorithm 1 sets up initial values required to start iterations of the loop in lines 3 to 10. Each iteration of the loop assumes that, for \(\ell \in [1: L]\), the values \(w_{\ell - 1}\) and \(w'_{\ell - 1}\) have already been determined in the previous iteration or in line 2. Each manager locally computes the auxiliary subsum \(y_m\) in line 4 and uses Shamir’s Secret Sharing to send its private share \(z_{m, m'}\) to the server \(S_{m'}\), for all \(m' \in [1: M]\). In line 5 of Algorithm 1, the \(t_{\ell , 1}, \dots , t_{\ell , M}\) encode \(w_{\ell } = \sum ^R_{r = 1} \varphi _\ell ({\mathbf {v}}_r, w_{\ell -1}, w'_{\ell -1})\) in (7), as explained in Sect. 2.3. As indicated in line 6, each manager \({\mathscr {M}}_m\) can determine the sum \(w_{\ell }\), which is a part of (7). In line 7, each manager locally computes the subproduct \(y'_m\) and sends its private shares \(z'_{m, m'}\) to the servers \(S_{m'}\), for all \(m' \in [1: M]\). In line 8 of Algorithm 1, the \(t'_{\ell , 1}, \dots , t'_{\ell , M}\) encode \(w'_{\ell } = \prod ^R_{r = 1} \varphi _\ell ({\mathbf {v}}_r, w_{\ell -1}, w'_{\ell -1})\) in (7), as explained in Sect. 2.3. In line 6, each manager \({\mathscr {M}}_m\) recovers the product \(w'_{\ell }\), which is a part of (7), as indicated in line 9. It follows from (7) that \(\psi (w_L, w'_{L}) = \varphi (B({\mathscr {D}}))\). The managers compute \(\psi (w_L, w'_{L})\) locally and send it to the client in line 11 of Algorithms 1.

3.2 The PAA-HE protocol

The PAA-HE protocol is described in Algorithm 2. It combines the ElGamal and Paillier encryption schemes in one system. The managers set up their ElGamal and Paillier encryption schemes in line 1 of Algorithm 2. Each manager \({\mathscr {M}}_m\) sends public Paillier key \(\mathsf {pk}_m\) and public ElGamal key \(\mathsf {pk}'_m\) to all other managers.

figure e

Line 3 of Algorithm 2 initializes the values \(w_0 = w_0 = 0\), required for iterations of the loop in lines 4 to 13. Each iteration of the loop assumes that, for \(\ell \in [1: L]\), the values \(w_{\ell - 1}\) and \(w'_{\ell - 1}\) have already been determined in the previous iteration or in line 3. Each manager locally computes the subsum \(y_m\) in line 5 and uses the Paillier encryption to encrypt it and to send the encryption \(e_{m, m'}\) to the corresponding server \(S_{m'}\), for all \(m' \in [1: M]\), line 6. The server computes the product \(p_m = \prod ^M_{m'=1} e_{m', m}\) and sends it to \({\mathscr {M}}_m\) in line 7. It follows from the homomorphic property (2) that \(p_m\) is an encryption of the sum \(w_{\ell } = \sum ^R_{r = 1} \varphi _\ell ({\mathbf {v}}_r, w_{\ell -1}, w'_{\ell -1})\) in (7). Every manager \({\mathscr {M}}_m\) uses the Paillier scheme to decrypt \(w_{\ell }\) in line 8. Each manager locally computes the subproduct \(y'_m\) in line 9 of Algorithm 2. The manager uses the ElGamal encryption scheme to encrypt \(y'_m\) using all public keys \(\mathsf {pk}'_{m'}\) and to send the encryption \(e'_{m, m'}\) to the corresponding server \(S_{m'}\), for all \(m' \in [1: M]\), line 10. The server computes the product \(p'_m = \prod ^M_{m'=1} e'_{m', m}\) and sends it to \({\mathscr {M}}_m\) in line 11. It follows from the homomorphic property (1) that \(p'_m\) is an encryption of the sum \(w'_{\ell } = \prod ^R_{r = 1} \psi _\ell ({\mathbf {v}}_r, w_{\ell -1}, w'_{\ell -1})\) in (7). Each manager \({\mathscr {M}}_m\) uses the ElGamal scheme to decrypt \(w'_{\ell }\) in line 12. It follows from (7) that \(\psi (w_L, w'_{L}) = \varphi (B({\mathscr {D}}))\). Each manager computes \(\psi (w_L, w'_{L})\) locally and sends it to the client in line 14 of Algorithms 2.

3.3 Theoretical analysis

For comparison, we include a direct application of the ElGamal and Paillier cryptosystems denoted by EGP. It transfers the required fields of all data vectors to the new servers in encrypted form and uses the homomorphic properties to perform the addition and multiplication of encrypted values without revealing their contents. The computation and communication complexities of PAA-SSS, PAA-HE, and EGP are presented in Table 2.

Table 2 Computation and communication complexities of the protocols

The security model considered in the present article assumes that all the managers \({\mathscr {M}}_1, \dots , {\mathscr {M}}_M\) are honest, but may be curious. This is a natural assumption, because the managers represent official organisations, which are not anonymous.

The servers \(S_1, \dots , S_M\) are new and are intended to process a lot of confidential information. They are likely to be targeted by the active outsider attackers. This is why our security model includes active outsider attackers capable of compromising some of the servers \(S_1, \dots , S_M\). This means that Byzantine faults may take place in the operation of the servers \(S_1, \dots , P_M\), when the faulty servers are trying to hide the fact that they have been compromised, but may output incorrect results of their calculations.

To concentrate on solving the new problem addressed in this paper, we assume that the communication between all participants is secure.

Theorem 1

The PAA-SSS protocol produces correct answers to distributed queries of the class \({\mathscr {C}}\) if the active outsider attackers have compromised at most \(\lceil M/3 \rceil - 1\) of the servers \(S_1, \dots , S_M\). The PAA-HE protocol produces correct answers to distributed queries of the class \({\mathscr {C}}\) if the active outsider attackers have compromised at most \(\lceil M/2 \rceil - 1\) of the servers \(S_1, \dots , S_M\). Moreover, in both of these cases, the active outsider attackers cannot derive confidential information of individual managers by combining the data received from the compromised servers during the execution of each protocol.

Proof

First, suppose that the active outsider attackers have compromised at most \(\lceil M/3 \rceil - 1\) of the servers \(S_1, \dots , S_M\). We are going to prove that the PAA-SSS protocol produces correct answers to distributed queries from the class \({\mathscr {C}}\), and that the active outsider attackers cannot derive confidential information of individual managers by combining the data available to them from the corresponding compromised servers.

Denote the number of the compromised servers \(S_1, \dots , S_M\) by \(k \le \lceil M/3 \rceil - 1\). It suffices to complete the proof in the most difficult case, where k is the largest integer with \(k \le \lceil M / 3 \rceil - 1\). To simplify notation, we assume that \(M = 3 k + 1\).

It follows that in the set of M private shares \(t_{\ell , m} = \Theta _m( z_{1, m}, \dots , z_{M, m} )\), for \(m \in [1 : M]\), calculated in line 5 of Algorithm 1, at most k of these private shares may be compromised.

In line 5 of Algorithm 1, the manager \({\mathscr {M}}_m\) uses (4) to recover \(w_{\ell }\) from

$$\begin{aligned} t_{\ell , 1}, \dots , t_{\ell , M}. \end{aligned}$$
(13)

At most k of (13) may be incorrect. Since \(t_{\ell , m} = \Theta _m(z_{1, m}, \dots , z_{M, m})\), it follows from (5) that the secret shares encoding \(w_\ell\) coincide with (13), which are equal to the values

$$\begin{aligned} f(\alpha ^0), f(\alpha ^1), \dots , f(\alpha ^{M - 1}), \end{aligned}$$
(14)

of the polynomial \(f(x) = w_\ell + a_1x + \cdots + a_{k}x^{k}\) encoding \(w_\ell\), as explained in Sect. 2.3.

Defining \(a_{k + 1} = a_{k + 2} = \cdots = a_{M - 1} = 0\), we get the sequence

$$\begin{aligned} a_0, a_1, \dots , a_{M - 1}. \end{aligned}$$
(15)

Equation (5.182) in [22, Sect. 5.8.9] shows that (14) is a Discrete Fourier Transform of (15). The formula for the Reverse Fourier Transform (Equation (5.184) in [22, Sect. 5.8.9]) implies that \(a_i = \frac{1}{M} \widehat{f}(\alpha ^{-i})\), for all \(i \in [0: M-1]\), where

$$\begin{aligned} \widehat{f}&= \Omega _1(w_\ell ) + \Omega _2(w_\ell ) x + \cdots + \Omega _M(w_\ell ) x^{M-1}. \end{aligned}$$
(16)

Therefore,

$$\begin{aligned} \widehat{f}(\alpha ^{-i}) = 0, \text{ for } i \in [k+1: M-1]. \end{aligned}$$
(17)

Since \(\alpha ^M = 1\), we get \(\alpha ^{-i} = \alpha ^{M - i}\) for \(i \in [0: M-1]\). If we substitute (16) into (17), then we get

$$\begin{aligned} \sum ^{M-1}_{i=0} \alpha ^{r \cdot i} \cdot \Omega _i(w_\ell ) = 0 \text{ for } r \in [1:2k]. \end{aligned}$$
(18)

It follows that

$$\begin{aligned} \alpha , \alpha ^2, \dots , \alpha ^{2k} \end{aligned}$$
(19)

are the roots of the polynomial (16). Since \(\alpha ^M = 1\) and \(M = 3k+1\), the set (19) is equal to the set

$$\begin{aligned} \alpha ^{k+1}, \alpha ^{k+2}, \dots , \alpha ^{M-1}. \end{aligned}$$
(20)

Denote by \(M_{\alpha ^i}(x)\) the minimal polynomial of \(\alpha ^i\). The above conditions prove that the polynomial

$$\begin{aligned} g(x)&= \text{ lcm } \{M_{\alpha ^{k+1}}(x), \dots , M_{\alpha ^{M-1}}(x)\} \end{aligned}$$
(21)

divides (16). It follows that (13) is a codeword in the cyclic code of length M generated by f(x). This means that this code is the BCH code of designed distance \(2k+1\) (see (5.105) in [22, Sect. 5.8.2]). As explained in [22, Sect. 5.8.4], it has an error-correcting algorithm, which corrects k errors. The manager can apply it and recover \(w_\ell\) in line 6 of Algorithm 1.

Likewise, in line 9 of Algorithm 1, the manager \({\mathscr {M}}_m\) can use the BCH error-correction algorithm to correct all possible errors and recover \(w'_{\ell }\).

After all iterations of the loop in line 3 of Algorithm 1, the managers recover \(w_L\) and \(w'_L\). Then every manager can locally compute the correct value \(\psi (w_L, w'_{L}) = \varphi (B({\mathscr {D}}))\).

Next, we prove that during steps of Algorithm 1 the active outsider attackers cannot derive confidential information concerning the data of separate managers by combining the values available to them from the corresponding compromised servers. It was proved in [19] that, for \(k \le \lceil M/3 \rceil - 1\), if a secret value is considered as a uniformly distributed random variable over \({\mathscr {F}}\), then the values (3) are k-wise independent random variables that are uniformly distributed over \({\mathscr {F}}\), and therefore a set of k shares gained by the active outsider attackers from the k compromised servers cannot help to discover any confidential information.

In Algorithm 1, formula (3) is applied in lines 3 and 7 to communicate three secret values as sets of secret shares sent to the servers.

Therefore, it is impossible for the active outsider attackers to derive any confidential information from the values available to them as private shares from at most k compromised servers.

In line 4, each server \(S_{m'}\) receives \(z_{m, m'}\), which is calculated using (3). It follows that the values received by the compromised servers are k-wise independent random variables uniformly distributed over \({\mathscr {F}}\). Hence the active outsider attackers cannot use these values to derive confidential information.

In line 7, each server \(S_{m'}\) receives \(z'_{m, m'}\), which is calculated using (3). Therefore, the values received by the compromised servers are k-wise independent random variables uniformly distributed over \({\mathscr {F}}\). Thus, it is impossible for the active outsider attackers to deduce confidential information using these values.

Finally, in line 8 Algorithm 1 the servers use the procedure described in Sect. 2.3 for computing \(t'_{\ell , m} = \Delta _m(z'_{1, m}, \dots , z'_{M, m})\), for \(m \in [1: M]\). This procedure involves communicating data between the servers. The compromised servers receive new intermediate values as secret shares. However, the procedure uses randomization polynomials as explained in Sect. 2.3. It follows that the intermediate secret shares transferred to the compromised servers during this procedure also are k-wise independent random variables uniformly distributed over \({\mathscr {F}}\). Again, it is impossible for the active outsider attackers to deduce confidential information from the intermediate secret shares transferred to the compromised servers during the computation of the \(t'_{\ell , m}\).

This proves that in Algorithm 1 the active outsider attackers cannot derive confidential information from the values they get from the compromised servers.

Second, suppose that the active outsider attackers compromised \(k \le \lceil M/2 \rceil - 1\) of the servers \(S_1, \dots , S_M\). It remains to prove that PAA-HE protocol produces correct answers to distributed queries from the class \({\mathscr {C}}\), and that the active outsider attackers cannot derive confidential information of individual managers by combining the data available to them from the compromised servers.

In Algorithm 2, the servers \(S_1, \dots , S_M\) perform identical computations. Evidently, \(\lceil M/2 \rceil - 1\) is the largest integer that is strictly less than M/2. Since \(k \le \lceil M/2 \rceil - 1\), we get \(k < M/2\). It follows that only the minority of the servers \(S_1, \dots , S_M\) are compromised, and so the managers can obtain correct results by using the majority of correct results received from uncompromised secure servers.

The servers \(S_1, \dots , S_M\) receive only encrypted values in Algorithm 2. They perform all calculations in encrypted form using the ElGamal and Paillier homomorphic properties. Since it is well known that the ElGamal and Paillier cryptosystems are secure, it follows that the active outsider attackers cannot derive any confidential information from the encrypted values available to them from the compromised servers. This completes the proof. \(\square\)

4 Experimental set up and outcomes

This section is devoted to experiments using real datasets from the UCI Machine Learning Repository [23], the parameters of which are summarized in Table 3. To investigate the performance of our protocols for larger collections of data, we generated synthetic sets with the numbers of vectors ranging up to \(10^9\).

Table 3 Datasets from the UCI Machine Learning Repository used in our experiments

Our experiments investigate the effectiveness of the PAA-SSS and PAA-HE protocols comparing them with EGP protocol.

Fig. 1
figure 1

The computation time (seconds) of PAA-HE, PAA-SSS, and EGP protocols with \(M=5\) and \(M=10\), for the data described in Table 3

Fig. 2
figure 2

The computation time (seconds) of PAA-HE, PAA-SSS, and EGP protocols with \(M=5\) and \(M=10\), for synthetic data

Fig. 3
figure 3

The data (MB) communicated by PAA-HE, PAA-SSS, and EGP protocols with \(M=5\) and \(M=10\), for the data described in Table 3

Fig. 4
figure 4

The data (MB) communicated by PAA-HE, PAA-SSS, and EGP protocols with \(M=5\) and \(M=10\), for synthetic data

The PAA-SSS and PAA-HE protocols are the first privacy-preserving protocols for computing of distributed numerical queries over large distributed collections of data providing protection against active outsider attackers and minimizing the communication and computation costs for big data. Other protocols considered in the literature previously, cannot protect against active outsider attackers in the situation considered in the present paper, and so they cannot be included in our experiments.

The experiments dealt with computing the geometric mean, it belongs to the difference of the two important classes \({\mathscr {C}} \setminus {\mathscr {K}}\), as explained in Sect. 3, and since it is an interesting statistic never considered in experimental studies in this research direction previously.

In our experiments, we included two values of the number M of the dataset managers: \(M = 5\) and \(M = 10\). Accordingly, every dataset was divided into \(M=5\) and \(M=10\) separate subsets of approximately equal size. A synthetic dataset with the number of vectors up to \(10^9\) and with normally distributed random values of confidential features was generated.

The communication time is proportional to the size of data transferred during the execution of the protocol divided by the speed of Internet transfer of data. This is why for comparing the communication costs of the protocols, our diagrams include the total size of data transferred in the experiments.

In this paper, our diagrams with the total communication costs contain only the volume of data communicated between the separate managers. The process of obtaining their original vectors from the corresponding local datasets is not a part of communication.

Since the communication time is determined by the size of data that has to be transferred in steps of the protocols and the bandwidth or speed of transfer of data over the Internet, we compare only the combined amount of data to be transferred to and from the servers \(S_1, \dots , S_M\).

The comparison of the performance of the algorithms are presented in Figs. 1, 2, 3, and 4. These outcomes show that both PAA-SSS and PAA-HE are much more efficient than the EGP protocol, and that PAA-HE outperforms all other protocols.

5 Conclusion

The development of privacy-enhancing techniques and protocols for data aggregation and analytics in wireless networks requires novel methods for efficient and privacy-preserving computation of distributed queries with the protection of outcomes from active attackers.

In this paper, we propose two protocols for the protection of confidential data from active outsider attackers in this situation: PAA-SSS and PAA-HE. The analysis and experimental outcomes demonstrate that PAA-SSS and PAA-HE are more efficient than alternative options.