Oblivious Sampling with Applications to Two-Party k-Means Clustering

Bunn, Paul; Ostrovsky, Rafail

doi:10.1007/s00145-020-09349-w

Oblivious Sampling with Applications to Two-Party k-Means Clustering

Published: 12 May 2020

Volume 33, pages 1362–1403, (2020)
Cite this article

Journal of Cryptology Aims and scope Submit manuscript

Paul Bunn¹ &
Rafail Ostrovsky²

629 Accesses
1 Citation
Explore all metrics

Abstract

The k-means clustering problem is one of the most explored problems in data mining. With the advent of protocols that have proven to be successful in performing single database clustering, the focus has shifted in recent years to the question of how to extend the single database protocols to a multiple database setting. To date, there have been numerous attempts to create specific multiparty k-means clustering protocols that protect the privacy of each database, but according to the standard cryptographic definitions of “privacy-protection”, so far all such attempts have fallen short of providing adequate privacy. In this paper, we describe a Two-Party k-Means Clustering Protocol that guarantees privacy against an honest-but-curious adversary, and is more efficient than utilizing a general multiparty “compiler” to achieve the same task. In particular, a main contribution of our result is a way to compute efficiently multiple iterations of k-means clustering without revealing the intermediate values. To achieve this, we describe a technique for performing two-party division securely and also introduce a novel technique allowing two parties to securely sample uniformly at random from an unknown domain size. The resulting Division Protocol and Random Value Protocol are of use to any protocol that requires the secure computation of a quotient or random sampling. Our techniques can be realized based on the existence of any semantically secure homomorphic encryption scheme. For concreteness, we describe our protocol based on Paillier Homomorphic Encryption scheme (see Paillier in Advances in: cryptology EURO-CRYPT’99 proceedings, LNCS 1592, pp 223–238, 1999). We will also demonstrate that our protocol is efficient in terms of communication, remaining competitive with existing protocols (such as Jagannathan and Wright in: KDD’05, pp 593–599, 2005) that fail to protect privacy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data privacy: a technological perspective and review

Article Open access 26 November 2016

Priyank Jain, Manasi Gyanchandani & Nilay Khare

Stratified random sampling from streaming and stored data

Article 23 October 2020

Trong Duc Nguyen, Ming-Hung Shih, … Bojian Xu

A review on design inspired subsampling for big data

Article 13 February 2023

Jun Yu, Mingyao Ai & Zhiqiang Ye

Notes

Although designed to be a secure k-means clustering protocol, [18] falls short of full security due to leakage of intermediate results, e.g. for each iteration of the Lloyd Step, the number of data points in each cluster is revealed.
Implicit in running the same protocol with the roles reversed is reliance upon a Change Modulus Protocol, which will allow the parties to translate their shares of Q (mod N) to shares of Q (mod $N^A$) or (mod $N^B$), where $N^A$ (resp. $N^B$) is the public-key modulus of the underlying encryption scheme of the subprotocols that are used, in which Alice (resp. Bob) knows the private key.
Since data is split between Alice and Bob, the exact value for $\mathtt {M}$ is not known by either party. There are various options for how this value can be computed exactly or estimated, and choice of which approach is most appropriate will depend on the nature of the data (e.g. are there natural known domains/bounds for each attribute? Has data been pre-normalized, or is normalization required anyway to ensure some attributes do not dominate the clustering protocol?) as well as privacy considerations (e.g. are players allowed to know the domains (or bounds) of each individual attribute?). If approximation of $\mathtt {M}$ is not possible, the two parties can engage in a secure multiparty computation protocol to compute $\mathtt {M}$, e.g. by applying Yao to the circuit that computes $\mathtt {M}^2$. Since such a circuit has size O(nd), this cost can be absorbed by the $O(\lambda nd)$ cost of communicating the nd encryptions that occurs at the outset of our protocol.

References

D. Agrawal, C. Aggarwal, On the design and quantification of privacy preserving data mining algorithms, in Proc. of the 20th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems (2001), pp. 247–255
R. Agrawal, R. Srikant, Privacy-preserving data mining, in Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data (2000), pp. 439–450
J. Algesheimer, J. Camenish, V. Shoup, Efficient computation modulo a shared secret with application to the generation of shared safe-prime products, in CRYPTO’02, LNCS 2442 (2002), pp. 417–432
A. Blum, C. Dwork, F. McSherry, K. Nissim, Practical privacy: the SuLQ framework, in 24th Symposium on Principles of Database Systems (2005), pp. 128–138
P. Bradley, U. Fayyad, Refining initial points for $K$-means clustering, in Proc. of the 15th International Conference on Machine Learning (1998), pp. 91–99
R. Canetti, Security and composition of multiparty cryptographic protocols. J. Cryptol., 13(1) 143–202. (2000)
Article MathSciNet Google Scholar
O. Catrina, A. Saxena, Secure computation with fixed-point numbers, in 14th Financial Cryptography and Data Security (2010), pp. 35–50
M. Ciampi, R. Ostrovsky, L. Siniscalchi, I. Visconti, Delayed-input non-malleable zero knowledge and multi-party coin tossing in four rounds, in 15th Theory of Cryptography (TCC) (2017), pp. 711–742
M. Dahl, C. Ning, T. Toft, On secure two-party integer division, in 16th Financial Cryptography and Data Security (2012), pp. 164–178
C. Dwork, F. McSherry, K. Nissim, A. Smith, Calibrating noise to sensitivity private data analysis, in Proc. of the 3rd Theory of Cryptography Conference (2006), pp. 265–284
I. Dinur, K. Nissim, Revealing information while preserving privacy, in Proc. of the 22nd ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems (2003), pp. 202–210
C. Dwork, K. Nissim, Privacy-preserving datamining on vertically partitioned data- bases, in CRYPTO’04, LNCS 3152 (2004), pp. 528–544
S. From, T. Jakobsen, Secure multi-party computation on integers. Master’s thesis, Univ. of Aarhus, Denmark, BRICS, Dep. of Computer Science (2006)
B. Goethals, S. Laur, H. Lipmaa, T. Mielikäinen, On private scalar product computation for privacy-preserving data mining, in ICISC, LNCS 3506. (2004), pp. 104–120
O. Goldreich, The Foundations of Cryptography, Basic Applications (Cambridge University Press, Cambridge, 2004)
Book Google Scholar
J. Guajardo, B. Mennink, B. Schoenmakers, Modulo reduction for paillier encryptions and application to secure statistical analysis, in 14th Financial Cryptography and Data Security (2010), pp. 375–382
Y. Isahi, E. Kushilevitz, R. Ostrovsky, A. Sahai, Zero-knowledge from secure multiparty computation, in ACM Symposium on Theory of Computing (2007)
G. Jagannathan, R. Wright, Privacy-preserving distributed $k$-means clustering over arbitrarily partitioned data, in KDD’05 (2005), pp. 593–599
S. Jha, L. Kruger, P. McDaniel, Privacy Preserving Clustering, in 10th European Symp. on Research in Computer Security (2005), pp. 397–417
E. Kiltz, G. Leander, J. Malone-Lee, Secure computation of the mean and related statistics, in TCC’05, LNCS 3378 (2005), pp. 283–302
Y. Lindell, B. Pinkas, Privacy preserving data mining, in CRYPTO’00, LNCS 1880 (2000), pp. 36–54
M. Naor, B. Pinkas, Oblivious polynomial evaluation. SIAM J. Comput., 35(5), 1254–1281. (2006)
Article MathSciNet Google Scholar
S. Oliveira, O.R. Zaïane, Privacy preserving clustering by data transformation, in Proc. 18th Brazilian Symposium on Databases (2003), pp. 304–318
R. Ostrovsky, Y. Rabani, L. Schulman, C. Swamy, The Effectiveness of Lloyd-Type Methods for the k-Means Problem (FOCS, 2006)
P. Paillier, Public key cryptosystems based on composite degree residuosity classes, in Advances in Cryptology EURO-CRYPT’99 Proceedings, LNCS 1592 (1999), pp. 223–238
J. Reif, S. Tate, Optimal size integer division circuits. SIAM J. Comput, 912–924 (1990)
C. Su, F. Bao, J. Zhou, T. Takagi, K. Sakurai, Privacy-preserving two-party $K$-means clustering via secure approximation, in 21st Inter. Conf. on Advanced Information Networking and Applications Workshops (2007), pp. 385–391
J. Vaidya, C. Clifton, Privacy-preserving $k$-means clustering over vertically partitioned data, in Proc. 9th ACM SIGDD Inter. Conf. on Knowledge Discovery and Data Mining (2003), pp. 206–215
T. Veugen, Encrypted integer division and secure comparison. Int. J. Appl. Cryptogr., 3(2), 166–180. (2014)
Article MathSciNet Google Scholar
R. Wright, Z. Yang, Privacy-preserving bayesian net- work structure computation on distributed heterogeneous data, in Proc. of the 10th ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining (2004), pp. 713–718
A.C.C. Yao, How to generate and exchange secrets, in Proc. of the 27th IEEE Symp. on Foundations of Computer Science (1986), pp. 162–167
H. Zhu, F. Bao, Oblivious scalar-product protocols, in 11th Australasian Conference on Information Security and Privacy, LNCS 4058 (2006), pp. 313–323

Download references

Author information

Authors and Affiliations

Stealth Software Technologies, Inc., Los Angeles, USA
Paul Bunn
Computer Science Department and Department of Mathematics, University of California, Los Angeles, CA, 90095, USA
Rafail Ostrovsky

Authors

Paul Bunn
View author publications
You can also search for this author in PubMed Google Scholar
Rafail Ostrovsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Bunn.

Additional information

Communicated by Damgard.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper appeared in the Proceedings of the 14th ACM Conference on Computer and Communications Security, pp. 486–497, 2007. Paul Bunn: Research partially done while at UCLA and visiting IPAM, and supported in part by NSF VIGRE Grant DMS-0502315, NSF Cybertrust Grant No. 0430254, and by DARPA and SPAWAR under Contract N66001-15-C-4065. Rafail Ostrovsky: Partially done while visiting IPAM. Author supported in part by NSF Grant 1619348, BSF Grant 2015782, DARPA SafeWare subcontract to Galois Inc., DARPA SPAWAR Contract N66001-15-1C-4065, JP Morgan Faculty Research Award, OKAWA Foundation Research Award, IBM Faculty Research Award, Xerox Faculty Research Award, B. John Garrick Foundation Award, Teradata Research Award, and Lockheed-Martin Corporation Research Award. The views expressed are those of the authors and do not reflect position of the Department of Defense or the U.S. Government.

Appendices

A Alternative Stopping Criterion for Lloyd Step

It is possible that the iterative nature of the Lloyd Step may reveal undesirable information to the two parties: the number of iterations that are performed in the Lloyd Step. We suggest three different approaches to handle this privacy concern:

Approach 1: Reveal Number of Iterations If Alice and Bob agree beforehand that this leak of information will not compromise the privacy of their data, they can choose to run our algorithm (as is) and reveal the number of iterations.
Approach 2: Set the Number of Iterations to be Proportional ton In general, the more data points, the more iterations are necessary to reach the stopping condition. Based on n, one could therefore approximate the expected number of iterations that should be necessary, and fix our protocol to perform this many iterations.
Approach 3: Fix the Number of Iterations to be Constant In [24], it is argued that if the data points enjoy certain “nice” properties, then the number of iterations is extremely small (i.e. with high probability, only 2 iterations are necessary). Thus, fixing the number of iterations to be some (small) constant will (with high probability) not result in a premature termination of the Lloyd Step (i.e. the stopping condition will likely have been reached).

Each approach has its pros and cons. Approach 1 guarantees the accuracy of the final output (as the stopping criterion has been met) in the minimal number of steps, but leaks information about how many iterations were performed. Approach 2 succeeds with high probability, but may unnecessarily affect communication complexity if the fixed number of iterations is higher than necessary. Approach 3 keeps communication minimal, but runs a higher risk of losing accuracy of the final output (i.e. if the stopping criterion hasn’t been reached after the fixed number of iterations have been completed). In the body of our paper, we assumed Approach 1, although it is trivial to modify our algorithm to implement instead Approach 2 or 3.

B Reordering Protocol

As mentioned in Sect. 4.1, this protocol can be thought of as selecting balls from a bag, where each ball is marked with an index $i \in [0..\lambda ]$. In particular, the bag will initially contain $2^i$ balls marked with index i for each $i \in [0..\lambda ]$. Then reordering $[0..\lambda ]$ is achieved by selecting a ball from the bag at random, and outputting the corresponding index as the first number in the reordered sequence. Next, all balls with that index are removed from the bag, and the procedure is repeated to generate the second number in the sequence, and so on until the bag is empty. We give below a formal treatment of this procedure.

Example Reordering Protocol Let $\lambda $ be an arbitrary positive integer. This protocol will reorder the digits $[0..\lambda ]$, or more formally, it will generate a permutation:

$$\begin{aligned} \tau : [0..\lambda ] \rightarrow [0..\lambda ] \end{aligned}$$

It will be more convenient (both in the protocol description as well as proofs) to describe $\tau $ by its inverse permutation $\sigma := \tau ^{-1}$. The following procedure generates the permutation $\sigma $ one element at a time.

1.
For $0 \le i \le \lambda $:
1. (a)
  If $i = 0$, define $U := 2^{\lambda + 1} - 1$. Otherwise, update U by subtracting $2^{\sigma (i-1)}$. Equivalently, if $U = u_{\lambda } \dots u_1 u_0$ denotes the binary representation of U, then U is initialized (for $i=0$) so that all binary digits are ‘1’, and then for each subsequent iteration, U is updated by flipping the $\sigma (i-1)$ binary digit from ‘1’ to ‘0’.
2. (b)
  For each $0 \le j \le \lambda $ with $u_j = 1$, define values $\{a_j\}$ as the value represented by the j lowest order binary digits of U:
  $$\begin{aligned} a_j = u_{j-1} \dots u_1 u_0 \end{aligned}$$
  These values are used to partition the interval [1..U] into $1 + \lambda -i$ intervals based on the binary representation of U. Namely, for each non-zero binary digit $0 \le j \le \lambda $ of U, the corresponding interval is from $[a_j + 1..a_j + 2^j]$.
3. (c)
  Choose a number $r \leftarrow [1..U]$ uniformly at random, and set $\sigma (i)$ equal to the interval index that r falls in. Formally, set $\sigma (i) = j$ if $r \in [a_j + 1..a_j + 2^j]$.

Claim The Example Reordering Protocol satisfies the Reordering Property in Definition 4.

Proof Sketch Viewing the Example Reordering Protocol as a formalization of the “selecting balls from a bag” description (we leave the reader to verify the formalization matches the intuition), we compute the probability that an index j appears first among an arbitrary set of indices ${\mathcal {I}}$. Let ${\mathcal {I}} \subseteq [0..\lambda ]$ and let $j \in {\mathcal {I}}$ be arbitrary. We utilize the Law of Total Probability to write:

$$\begin{aligned}&\text{ Probability } \text{ that }\ j\ \text{ appears } \text{ first } \text{ among } \text{ all } \text{ indices } \text{ in } \ {\mathcal {I}}\nonumber \\&\quad = \sum _{i = 0}^{1 + \lambda - |{\mathcal {I}}|} \left[ (\text{ Probability }\ j\ \text{ is } \text{ output } \text{ in } \text{ iteration }\ i \ | \ i\ \text{ is } \text{ the }\ \textit{first}\ \text{ time }\ \textit{some}\ \text{ index } \text{ in } \ {\mathcal {I}}\ \text{ is } \text{ output })\right. \nonumber \\&\qquad \left. \cdot (\text{ Probability }\ i\ \text{ is } \text{ the }\ \textit{first}\ \text{ time } \ \textit{some}\ \text{ index } \text{ in }\ {\mathcal {I}}\ \text{ is } \text{ output })\right] \end{aligned}$$

(20)

where the above sum stops at $i = 1 + \lambda - |{\mathcal {I}}|$ because if we reach this iteration without having output any indices in ${\mathcal {I}}$, then all that is left in the bag at iteration $i = 1 + \lambda - |{\mathcal {I}}|$ are balls with an index in ${\mathcal {I}}$. Notice that at any iteration i, the first probability on the RHS of (20) is independent of i, namely it is $2^j/\sum _{k\in \ {\mathcal {I}}} 2^k$, since we are conditioning on a ball from ${\mathcal {I}}$ being selected, and there are $2^k$ balls in the bag for every index k. Since this quantity is independent of i, it can be removed from the sum:

$$\begin{aligned}&\text{ Probability } \text{ that }\ j\ \text{ appears } \text{ first } \text{ among } \text{ all } \text{ indices } \text{ in } \ {\mathcal {I}} \nonumber \\&\quad =\left( 2^j/\sum _{k\in \ {\mathcal {I}}} 2^k \right) \ \cdot \ \sum _{i = 0}^{1+ \lambda - |{\mathcal {I}}|} \ (\text{ Probability }\ i \ \text{ is } \text{ the }\ \textit{first}\ \text{ time }\ \textit{some} \ \text{ index } \text{ in }\ {\mathcal {I}}\ \text{ is } \text{ output }) \nonumber \\&\quad =2^j/\sum _{k\in \ {\mathcal {I}}} 2^k, \end{aligned}$$

(21)

where the last equality is due to the fact that we are summing over the complete probability space, i.e. the sum of probabilities that a ball from ${\mathcal {I}}$ is first selected in iteration i, as i ranges from $[0..(\lambda - |{\mathcal {I}}|)]$, equals one. Notice that (21) matches the Reordering Property (10), as desired. $\square $

C Implementations of Protocols from Sect. 2.2

We describe here possible implementations of each of the (non-referenced) protocols listed in Sect. 2.2. We provide these implementations solely for the purpose of completion, and make no claim concerning their efficiency in relation to other existing protocols that perform the same tasks. Since we need each of these protocols to be secure against an honest-but-curious adversary, we need the communication in each subprotocol to be in the generic form of Lemma 2.1 or to utilize other protocols that are already known to be secure; and indeed this will be the case in each of the following.

1.1 C.1 Description of the Find Minimum of 2 Numbers Protocol

Input Let $X = x_{\lambda } \dots x_1 x_0$ and $Y = y_{\lambda } \dots y_1 y_0$ denote the binary representations of two values $X, Y < N$. For each $0 \le i \le \lambda $, Alice and Bob share $x_i$ and $y_i$ (mod N).

Output Alice and Bob should share 0 if $X < Y$ or share 1 if $X > Y$. If $X=Y$, they will share either 0 or 1 depending on an agreed upon distribution, e.g. they can choose to always output ‘0’ in the case of equality, or always output ‘1’, or output ‘0’ according to some fixed probability r.

Cost Communication cost of this protocol is $O(\lambda ^2)$, where $\lambda = \lfloor \text{ log } N \rfloor $.

Protocol Description This protocol will be completed by performing a standard minimum comparison on the binary representations of these numbers. In general, note that the following formula will return the location of the minimum of (X, Y), where the formula returns 0 if $X < Y$, a 1 if $X>Y$, and a value $r \in \{0,1\}$ if $X=Y$:

$$\begin{aligned} L&= (x_{\lambda } \oplus y_{\lambda })x_{\lambda } \nonumber \\&\quad +(x_{\lambda } \oplus y_{\lambda } \oplus 1)(x_{\lambda - 1} \oplus y_{\lambda - 1})x_{\lambda - 1} \nonumber \\&\quad +(x_{\lambda } \oplus y_{\lambda } \oplus 1)(x_{\lambda - 1} \oplus y_{\lambda - 1} \oplus 1)(x_{\lambda - 2} \oplus y_{\lambda - 2})x_{\lambda - 2} \nonumber \\&\quad +\dots \nonumber \\&\quad +(x_{\lambda } \oplus y_{\lambda } \oplus 1) \dots (x_1 \oplus y_1 \oplus 1)(x_0 \oplus y_0)x_0 \nonumber \\&\quad +(x_{\lambda } \oplus y_{\lambda } \oplus 1) \dots (x_1 \oplus y_1 \oplus 1)(x_0 \oplus y_0 \oplus 1)\cdot r \end{aligned}$$

(22)

where $\oplus $ signifies XOR, and the other operations are performed in ${\mathbb {Z}}_N$. Shares of the output can then be obtained by running the SPP many times, utilizing the fact that:

$$\begin{aligned} x \oplus y = x + y - 2xy, \end{aligned}$$

(23)

where addition on the left hand side is in ${\mathbb {Z}}_2$ and on the right hand side is in ${\mathbb {Z}}_N$.

1.2 C.2 Description of the To Binary Protocol

Input Alice and Bob share $X = X^A + X^B (\hbox {mod}\ N)$.

Output If $X = x_{\lambda } \dots x_1 x_0$ is the binary representation for X, then Alice and Bob share each binary digit $x_i = x^A_i + x^B_i$ (mod N).

Cost Communication cost of this protocol is $O(\lambda ^2)$, where $\lambda = \lfloor \text{ log } N \rfloor $.

Protocol Description Notice that there are two possibilities for how to compute X from the shares $X^A$ and $X^B$ (with arithmetic in ${\mathbb {Z}}$):

$$\begin{aligned} X = {\left\{ \begin{array}{ll} X^A + X^B, &{} \text{ if } X^A + X^B < N \\ X^A + X^B - N, &{} \text{ if } X^A + X^B \ge N \end{array}\right. } \end{aligned}$$

This protocol will find (shares of) the binary representation of both $X^A + X^B$ and $X^A + X^B - N$, and then invoke the FM2NP (combined with SPP) to select the proper case. Details are as follows: first, Alice and Bob will obtain (shares of) the binary representation of $X^A + X^B$. In particular, if $X^A := a_\lambda \dots a_1 a_0$, $X^B := b_\lambda \dots b_1 b_0$, then the following formula generates the binary representation $X^A + X^B = x_\lambda \dots x_1 x_0$:

$$\begin{aligned}&a_\lambda \dots a_1 a_0 \nonumber \\&\underline{\oplus \quad b_\lambda \dots b_1 b_0} \nonumber \\&x_\lambda \dots x_1 x_0, \end{aligned}$$

(24)

where the $\oplus $ symbol above means standard addition in ${\mathbb {Z}}_2$ (i.e. performed base 2, with carry-over). The computation in (24) can be done following the standard (insecure) manner of computation: start on the right and add the bits via XOR, keeping track of carry-over. This can be readily extended to a secure protocol by invoking SPP.

Next, shares of the binary representation of $2^{\lambda + 1} - N + X^A + X^B$ can be computed similarly, since, e.g. if $2^{\lambda + 1} - N = d_\lambda \dots d_1 d_0$ is the binary representation of $2^{\lambda + 1} - N$ (which is publicly known by both parties, since N and $\lambda $ are public), then the binary representation of $2^{\lambda + 1} - N + X^A + X^B = y_{\lambda + 1} y_\lambda \dots y_1 y_0$ can be computed via:

$$\begin{aligned}&a_\lambda \dots a_1 a_0 \nonumber \\&b_\lambda \dots b_1 b_0 \nonumber \\&\underline{\oplus \quad d_{\lambda + 1} d_\lambda \dots d_1 d_0} \nonumber \\&y_{\lambda + 1} y_\lambda \dots y_1 y_0 \end{aligned}$$

(25)

After computing (shares of) the binary representation for $2^{\lambda + 1} - N + X^A + X^B$, one of the parties will subtract ‘1’ from their share of the leading bit $y_{\lambda + 1}$, so that the parties will share the leading bit $\widehat{y}_{\lambda + 1} =y_{\lambda + 1} - 1$. Notice that in the case that $X^A + X^B \ge N$, this will result in the two parties sharing (the binary representation of) $X^A + X^B - N = \widehat{y}_{\lambda + 1} y_\lambda \dots y_1 y_0$.

Finally, the two parties can run the FM2NP on, e.g. $(X^A, N-X^B)$ to determine if $X^A + X^B < N$ (the first input $X^A$ is the minimum if and only if $X^A + X^B < N$); notice we use the indicated inputs to FM2NP so that we can use the version where the parties share the binary representations of the inputs (see Section C.1); i.e. Alice knows the binary digits of $X^A$ (Bob can set his shares to 0) and Bob knows the binary digits of $N-X^B$ (Alice can set her shares of to 0). Alice and Bob can then run SPP to compute their final shares:

$$\begin{aligned} c \cdot (x_\lambda \dots x_1 x_0) + (1-c) \cdot (\widehat{y}_{\lambda + 1} y_\lambda \dots y_1 y_0), \end{aligned}$$

(26)

where $c \in \{0, 1\}$ is ‘1’ iff $X^A + X^B < N$.

1.3 C.3 Description of the Find Minimum of 2 Numbers Protocol

This protocol is similar to the protocol described in Section C.1, except that here Alice and Bob share X and Y, as opposed to sharing each binary digit.

Input Alice and Bob share two values $X = X^A + X^B \ (\hbox {mod}\ N)$ and $Y = Y^A + Y^B \ (\hbox {mod}\ N)$.

Output Alice and Bob should share 0 if $X < Y$ or share 1 if $X > Y$. If $X=Y$, they will share either 0 or 1 depending on an agreed upon distribution, e.g. they can choose to always output ‘0’ in the case of equality, or always output ‘1’, or output ‘0’ according to some fixed probability r.

Cost Communication cost of this protocol is $O(\lambda ^2)$, where $\lambda = \lfloor \text{ log } N \rfloor $.

Protocol Description This protocol simply has Alice and Bob utilize the To Binary Protocol twice (once for X and once for Y), and then proceeds with the Find Minimum of 2 Numbers Protocol of Section C.1.

1.4 C.4 Description of the Nested Product Protocol

Input Alice and Bob share a set of values: $\{X_i = X^A_i + X^B_i \ (\hbox {mod}\ N)\}_{i=1}^m$.

Output For each $1 \le i \le m$, Alice and Bob share: $Y_i := \prod _{j=1}^i X_j$.

Cost Cost of $(m-1)$ calls to SPP applied to a two-term function ($O(m \cdot \lambda )$, where $\lambda = \log _2 N$).

Protocol Description Notice Alice and Bob already share $Y_1$, which equals $X_1$. We describe how (shares of) term $Y_{i+1}$ can be obtained from (shares of) $Y_i$. Namely, let $Y_i =Y^A_i + Y^B_i \ (\hbox {mod}\ N)$. Then Alice and Bob can compute (shares of) $Y_{i+1} = Y_i \cdot X_{i+1}$ using the SPP applied to the degree-two function $(Y^A_i + Y^B_i)\cdot (X^A_{i+1}+X^B_{i+1}) =(Y^A_i \cdot X^A_{i+1}) + (Y^A_i \cdot X^B_{i+1}) + (Y^B_i \cdot X^A_{i+1}) + (Y^B_i \cdot X^B_{i+1})$ (all arithmetic modulo N).

1.5 C.5 Description of the Find Minimum of k Numbers Protocol

Input Alice and Bob share k values $\{X_i =X^A_i + X^B_i \ (\hbox {mod}\ N)\}$.

Output Viewing the k values as a vector in ${\mathbb {Z}}_N^k$, Alice and Bob share the characteristic vector $\mathbf {e}_i \in {\mathbb {Z}}_2^k$ with the ‘1’ in the ith position, where i is the location of $\min (X_1, \dots , X_k)$.

Cost Communication cost of this protocol is $(k-1)$ times the cost of FM2NP plus $O(k\lambda ^2)$.

Protocol Description This protocol can be obtained as a straightforward extension of (a variant of the) FM2NP. In particular, in addition to having the FM2NP output (shares of) the location of the minimum, have it also output (shares of) the value of the minimum. Notice that the value of the minimum can be obtained from the location by running the SPP, since:

$$\begin{aligned} \min (x,y) = (1-L) \cdot x + L \cdot y, \end{aligned}$$

(27)

where $L \in \{0, 1\}$ is the location of the minimum of (x, y). Since the function in (27) has a constant number of terms, the cost of employing SPP is $O(\lambda )$, which can be absorbed in the $O(\lambda ^2)$ cost of running the FM2NP.

Suppose Alice and Bob have k inputs (WLOG, assume $k = 2^m$ is a power of 2). They pair-off the k inputs into k/2 sets of pairs, and run this alternate FM2NPk/2 times, obtaining (shares of) the location and minimum value of each pair: $(\mathbf {e}_{l_j}, z_j)$ for each $1 \le j \le k/2$, where $l_j \in \{0, 1\}$ denotes the location of the minimum of the jth pair, $\mathbf {e}_{l_j}$ denotes the characteristic vector (in ${\mathbb {Z}}_2^2$) with a ‘1’ in the $l_j$th position, and $z_j$ denotes the minimum value within the jth pair of values. This procedure is then repeated by pairing up the k/2 minimums $\{z_j\}$ into k/4 sets, and running FM2NPk/4 times, and so on. In the end, FM2NP will need to be run a total of $k-1$ times.

Notice that (shares of) the final minimum value is a direct output of these $k-1$ calls to FM2NP; if the minimum’s location is required (as per specification of the Output of the FMkNP above), then this location $\mathbf {e}_i \in {\mathbb {Z}}_2^k$ can be obtained as follows. First, the minimum value is compared to each input, such that for each comparison, the parties will share ‘1’ if that value matches the minimum, and share ‘0’ otherwise. (This requires k calls to a secure Check Equality protocol, that compares two numbers (each shared between Alice and Bob) and returns a ‘1’ if the two numbers are equal, and ‘0’ otherwise. Such a protocol could be implemented with $O(\lambda ^2)$ communication cost by mimicking the circuit representation for equality on two (binary) numbers; namely, invoking the To Binary Protocol to obtain (shares of) the binary representations of the two numbers, and then applying the SPP and NPP to ‘AND’ together the equality check on the binary digits.) The result of these k calls to a secure equality protocol is almost enough to yield $\mathbf {e}_i$, except that multiple inputs may equal the minimum value. To control for this case, we can utilize the Nested Product Protocol (on k terms) to arrive at the final $\mathbf {e}_i$. Namely, if $u_j$ denotes the output of the equality protocol that compares the minimum with the jth input, then the jth coordinate of $\mathbf {e}_i$ is given by:

$$\begin{aligned} \mathbf {e}_{i,j} = u_j \cdot \prod _{k < j} (1 - u_k) \end{aligned}$$

1.6 C.6 Description of the Change Modulus Protocol

Setup Let $N_1, N_2 \in {\mathbb {Z}}$ be two positive integers, and let $Q < \min (N_1, N_2)$ be an arbitrary non-negative integer smaller than both $N_1$ and $N_2$.

Input Alice and Bob share $Q = Q^A + Q^B \ (\hbox {mod}\ N_1)$ modulo the first value $N_1$.

Output Alice and Bob share $Q= \widehat{Q}^A +\widehat{Q}^B \ (\hbox {mod}\ N_2)$ modulo the second value $N_2$.

Cost Cost of FM2NP ($O(\lambda ^2)$, where $\lambda = \lfloor \log _2 N_1 \rfloor $).

Protocol Description There are two cases for how Q relates to $Q^A$ and $Q^B$ (in terms of ordinary arithmetic in ${\mathbb {Z}}$):

$$\begin{aligned} Q = \left\{ \begin{array}{ll} Q^A + Q^B&{}\quad \text{ if }\ Q^A + Q^B < N_1 \\ Q^A + Q^B - N_1 &{} \quad \text{ if }\ Q^A + Q^B \ge N_1 \end{array}\right. \end{aligned}$$

Alice and Bob compute their new shares of Q modulo $N_2$ as:

$$\begin{aligned} \begin{array}{rl} \text{ Alice } \text{ sets } &{}\widehat{Q}^A = (Q^A - b^A \cdot N_1) \ (\text{ mod } N_2)\\ \text{ Bob } \text{ sets } &{}\widehat{Q}^B = (Q^B - b^B \cdot N_1) \ (\text{ mod } N_2), \end{array} \end{aligned}$$

(28)

where:

$$\begin{aligned} b := b^A + b^B \ (\text{ mod } N_2) = \left\{ \begin{array}{ll} 0&{}\quad \text{ if }\ Q^A + Q^B < N_1 \\ 1 &{} \quad \text{ if }\ Q^A + Q^B \ge N_1 \end{array}\right. \end{aligned}$$

Notice that (shares of) b can be computed via the FM2NP (on inputs $N_1$ and Q), and then (28) can be computed locally by each party.

1.7 C.7 Description of the Addition Modulo Unknown Value Protocol

Setup$Q < N \in {\mathbb {Z}}$ are two positive integers.

Input Alice and Bob share $Q = Q^A + Q^B \ (\hbox {mod} \ N)$ and also share $X = X^A + X^B \ (\hbox {mod}\ N)$ and $Y = Y^A + Y^B \ (\hbox {mod}\ N)$ with $X, Y < Q$.

Output Alice and Bob share (modulo N) the sum of $X + Y$ (modulo Q):

$$\begin{aligned} X + Y \ (\text{ mod } Q) \ = \ Z^A + Z^B \ (\text{ mod } N) \end{aligned}$$

Cost Cost of FM2NP ($O(\lambda ^2)$) plus SPP applied to a two-term function ($O(\lambda )$), where $\lambda = \lfloor \log _2 N \rfloor $.

Protocol Description Since $X,Y <Q$, we can write (with arithmetic in ${\mathbb {Z}}$):

$$\begin{aligned} X+Y \ (\text{ mod } Q) = \left\{ \begin{array}{ll} X + Y &{}\quad \text{ if }\ X + Y < Q \\ X+Y -Q &{} \quad \text{ if }\ X+Y \ge Q \end{array}\right. \end{aligned}$$

Alice and Bob compute shares of $X+Y \ (\text{ mod } Q)$ as:

$$\begin{aligned} \begin{array}{rl} \text{ Alice } \text{ sets } &{}Z^A = (X^A + Y^A - C^A) \ (\text{ mod } N)\\ \text{ Bob } \text{ sets } &{}Z^B = (X^B + Y^B - C^B) \ (\text{ mod } N), \end{array} \end{aligned}$$

(29)

where $C = C^A+C^B\ (\text{ mod } N) := b * Q$, where:

$$\begin{aligned} b := b^A + b^B \ (\text{ mod } N) = \left\{ \begin{array}{ll} 0&{}\quad \text{ if }\ X + Y < Q \\ 1 &{} \quad \text{ if }\ X + Y \ge Q \end{array}\right. \end{aligned}$$

Notice that (shares of) b can be computed via the FM2NP (on inputs $X+Y$ and Q), and (shares of) C can be obtained via the SPP applied to the degree-two function $b\cdot Q = (b^A + b^B)\cdot (Q^A+Q^B) = (b^A \cdot Q^A) + (b^A \cdot Q^B) + (b^B \cdot Q^A) + (b^B \cdot Q^B)$ (all arithmetic modulo N). From shares of C, (29) can be computed locally by each party.

1.8 C.8 Compute Modulus Mask Protocol

Input Alice and Bob share $D = D^A + D^B \ (\hbox {mod}\ N)$.

Output Alice and Bob share $\mathbf {v} =(v_{\lambda }, v_{\lambda - 1}, \dots , v_1, v_0) \in {\mathbb {Z}}_2^{\lambda +1}$, where $\lambda = \lfloor \log _2 N \rfloor $ and the ith coordinate of $\mathbf {v}$ obeys:

$$\begin{aligned} v_i = \left\{ \begin{array}{ll} 0 \quad &{} \text{ if } \ 2^iD \ge N \\ 1 \quad &{} \text{ if } \ 2^iD < N \end{array} \right. \end{aligned}$$

Cost. Communication cost of this protocol is $\lambda $ calls to FM2NP and SPP.

Protocol Description

1.
Define $O_0 = 1$. Repeat the following for $1 \le i \le \lambda $: Alice and Bob run the FM2NP on $(N-2^{i-1}D, \ N-2^iD-1)$; let $O_i \in \{0, 1\}$ denote the output. Note that this protocol will use the version of FM2NP that always outputs ‘0’ in the case of equality.
2.
Let $\mathbf {O} = (O_{\lambda }, O_{\lambda -1}, \dots , O_1, O_0)$ denote the vector formed by the $\lambda $ calls to FM2NP (plus the default $O_0 = 1$ coordinate), and notice that $\mathbf {O} = (?, \dots ?, 0, 1, \dots , 1)$, where the rightmost coordinates are (at least one) 1’s, preceded by (at least one) 0, which is preceded by the leading coordinates of $\mathbf {O}$ which are unimportant. Note that the first ‘0’ (reading from right-to-left) occurs in the ith coordinate iff i is the first time $2^iD \ge N$. Alice and Bob can modify this to share $\mathbf {v}= (0, \dots , 0, 1, \dots 1)$ by running the SPP$(\lambda -1)$ times: namely, for $2 \le i \le \lambda $, compute ${v}_{i} = {v}_{i-1} \cdot {O}_{i}$ (there is no need to compute $v_0$ or $v_1$, which can be directly set to equal $O_0$ and $O_1$, respectively).

1.9 C.9 Compute $\mathbf {e}_i$ Protocol

Setup Let $E: {\mathbb {G}}_1 \rightarrow {\mathbb {G}}_2$ be a public-key homomorphic encryption scheme admitting scalar multiplication (e.g. Paillier) for which Alice has the decryption key (and Bob does not). Let $N = |{\mathbb {G}}_1|$ denote the size of the plaintext group, and let $\lambda = \lfloor \log _2 N \rfloor $ be the security parameter.

Input Alice and Bob share $Q = Q^A + Q^B (\hbox {mod}\ N)$, and writing its binary representation as $Q=q_\lambda \dots q_1 q_0$, then for each $0 \le i \le \lambda $, they also share $q_i = q^A_i + q^B_i$ (mod N). Bob also has run a Reordering Protocol to get a reordering of the integers $[0..\lambda ]$, which is denoted $\{\sigma (0), \ \sigma (1), \ \dots , \ \sigma (\lambda )\}$.

Output Alice and Bob share the unit vector $\mathbf {e} = (0, \dots , 1, \dots , 0) \in {\mathbb {Z}}_2^{1 + \lambda }$, where the unique ‘1’ appears in coordinate i with probability:

$$\begin{aligned} \begin{array}{ll} 0\quad &{}\text{ if } \ q_i = 0\\ 2^i/Q&{}\text{ if }\ q_i = 1 \end{array} \end{aligned}$$

(30)

Cost. Communication cost of this protocol is $O(\lambda ^2)$: There are $2\lambda + 2$ ciphertexts (of size $O(\lambda )$) and invocation of a NPP on $O(\lambda )$ terms and $O(\lambda )$ invocations of a SPP.

Protocol Description Let $\mathbf {e}_j = (0, \dots , 0, 1, 0, \dots , 0)$ denote the characteristic vector with a ‘1’ in the jth coordinate.

1.
Alice sends Bob $\{E(q^A_0), \ E(q^A_1), \ \dots , \ E(q^A_\lambda )\}$.
2.
Bob picks $1 + \lambda $ elements $\{ Z_0, \ Z_1, \ \dots , \ Z_\lambda \} \leftarrow _R {\mathbb {G}}_1$ uniformly at random and (utilizing the homomorphic properties of E) returns to Alice $\{E(q^A_{\sigma (0)} - Z_0), \ E(q^A_{\sigma (1)} - Z_1), \ \dots , \ E(q^A_{\sigma (\lambda )} - Z_{\lambda })\}$ who decrypts each term. Notice that Bob has rearranged the order in which he returns things to Alice based on his choice of $\sigma $, but Alice doesn’t know the new order because Bob has blinded each term with randomness $Z_i$. Thus, for each $0 \le i \le \lambda $, Alice and Bob now share $q_{\sigma (i)} = ((q^A_{\sigma (i)} - Z_i) + (q^B_{\sigma (i)} + Z_i)) \ ($mod N), with the property that neither party knows any of the values $\{q_{\sigma (i)} \}$, and that Alice knows nothing about $\sigma $.
3.
Alice and Bob compute and output (shares of) the quantity:
$$\begin{aligned} \ q_{\sigma (0)} \cdot \mathbf {e}_{\sigma (0)} + \sum _{j=1}^{\lambda } \left[ q_{\sigma (j)} \cdot \left( \prod _{k=0}^{j-1} (1 - q_{\sigma (k)}) \right) \cdot \mathbf {e}_{\sigma (j)} \right] \end{aligned}$$
(31)
by utilizing a secure Nested Product Protocol (NPP) and a Scalar Product Protocol (SPP). Namely, the terms $\prod _k (1 - q_{\sigma (k)})$ are computed and shared via NPP, the terms $\{q_{\sigma (j)}\}$ were shared in Step (2), and Bob can construct the $\{\mathbf {e}_{\sigma (j)}\}$ terms locally.

Proof of Correctness and Security Correctness follows from the fact that the output value in (31) is the same formula as appeared in the (Insecure) RVP Step 3 (see (11)), and Correctness of the (Insecure) RVP Step 3 protocol was proven in Sect. 4.1.

Security follows from the fact that all communication between Alice and Bob can be classified as one of:

A.
Encryptions sent from Alice to Bob in Step 1.
B.
The randomized ciphertexts sent from Bob to Alice in Step 2.
C.
A secure subprotcol (NPP and SPP) in Step 3.

Security of communication in (A) follows from semantic security of the encryption scheme E. Security of communication in (B) follows from the homomorphic property of E and the fact that Bob chose uniform randomness to blind the returned values to Alice. Security of communication in (C) follows from the security of the subprotocols.

$\square $

1.10 C.10 Choose ${\varvec{\mu }}_1$ Protocol

Setup Let $E: {\mathbb {G}}_1 \rightarrow {\mathbb {G}}_2$ be a public-key homomorphic encryption scheme admitting scalar multiplication (e.g. Paillier) for which Alice has the decryption key (and Bob does not). Let $N = |{\mathbb {G}}_1|$ denote the size of the plaintext group, and let $\lambda = \lfloor \log _2 N \rfloor $ be the security parameter.

Input See the setup/notation from Sect. 5.3 where this subprotocol is called. Namely, Alice and Bob have run the RVP, which has returned to them shares of a random $R \in {\mathbb {Z}}_{2n \bar{C}}$. They also share $\bar{C}$ and for each $1 \le i \le n$, they share $\widetilde{C}_i$.

Output Alice and Bob share ${\varvec{\mu }}_1 = \mathbf {D}_i$, where $\mathbf {D}_i$ has been chosen with the correct probability.

Cost Communication of this protocol is a single call to FMnNP plus $n+d$ ciphertexts of size $O(\lambda )$.

Protocol Description

1.
Alice creates the vector $\mathbf {Z}^A \in {\mathbb {Z}}^n_N$, defined as follows:
$$\begin{aligned} \mathbf {Z}^A&= (\bar{C}^{A} + \widetilde{C}^{A,0}_1 -R^A, \ \ 2\bar{C}^{A} + \widetilde{C}^{A,0}_1 + \widetilde{C}^{A,0}_2 -R^A, \ \dots , \\&\qquad n \bar{C}^{A} + \widetilde{C}^{A,0}_1 + \dots + \widetilde{C}^{A,0}_n -R^A). \end{aligned}$$
Notice that the ith coordinate of $\mathbf {Z}^A$ is: $i \bar{C}^{A} -R^A + \sum _{j=1}^i \widetilde{C}^{A,0}_j$.
Bob does similarly to obtain $\mathbf {Z}^B$.
2.
Alice and Bob run the FMnNP on the vector $\mathbf {Z} \in {\mathbb {Z}}^n_N$, which will return the (shares of) $\mathbf {L} = (L_1, L_2, \dots , L_n) \in \{0, 1\}^n$, with the unique ‘1’ in coordinate i where i is the first time that $R \le \sum _{j=1}^i \bar{C} + \widetilde{C}^0$. Alice encrypts her share $\mathbf {L}^A$ and sends this to Bob.
3.
Bob can now compute (an encryption of) the scalar product:
$$\begin{aligned} {\varvec{\mu }}_1&= \mathbf {L} \cdot (\mathbf {D}_1, \dots , \mathbf {D}_n) \nonumber \\&= \sum _{i=1}^n L_i \cdot \mathbf {D}_i \end{aligned}$$
(32)
More precisely, for each $1 \le i \le n$, Bob will have to compute d products to evaluate $L_i \cdot \mathbf {D}_i$, one for each dimension. After randomizing each product, he returns these values to Alice, so that they now share ${\varvec{\mu }}_1$.

Note that (32) could have been calculated by having Alice and Bob invoke the SPP with their shares of $\mathbf {L}$ and each $\mathbf {D}_i$. However, this would cost O(nd) invocations of a SPP, for an overall communication cost of $O(\lambda nd)$, which exceeds the cost of the above described protocol ($O(\lambda ^2n + \lambda (n+d))$) so long as $\lambda <n$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bunn, P., Ostrovsky, R. Oblivious Sampling with Applications to Two-Party k-Means Clustering. J Cryptol 33, 1362–1403 (2020). https://doi.org/10.1007/s00145-020-09349-w

Download citation

Received: 27 April 2009
Revised: 22 April 2019
Published: 12 May 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00145-020-09349-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Oblivious Sampling with Applications to Two-Party k-Means Clustering

Abstract

Access this article

Similar content being viewed by others

Big data privacy: a technological perspective and review

Stratified random sampling from streaming and stored data

A review on design inspired subsampling for big data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Alternative Stopping Criterion for Lloyd Step

B Reordering Protocol

C Implementations of Protocols from Sect. 2.2

1.1 C.1 Description of the Find Minimum of 2 Numbers Protocol

1.2 C.2 Description of the To Binary Protocol

1.3 C.3 Description of the Find Minimum of 2 Numbers Protocol

1.4 C.4 Description of the Nested Product Protocol

1.5 C.5 Description of the Find Minimum of k Numbers Protocol

1.6 C.6 Description of the Change Modulus Protocol

1.7 C.7 Description of the Addition Modulo Unknown Value Protocol

1.8 C.8 Compute Modulus Mask Protocol

1.9 C.9 Compute \(\mathbf {e}_i\) Protocol

1.10 C.10 Choose \({\varvec{\mu }}_1\) Protocol

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Oblivious Sampling with Applications to Two-Party k-Means Clustering

Abstract

Access this article

Similar content being viewed by others

Big data privacy: a technological perspective and review

Stratified random sampling from streaming and stored data

A review on design inspired subsampling for big data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Alternative Stopping Criterion for Lloyd Step

B Reordering Protocol

C Implementations of Protocols from Sect. 2.2

1.1 C.1 Description of the Find Minimum of 2 Numbers Protocol

1.2 C.2 Description of the To Binary Protocol

1.3 C.3 Description of the Find Minimum of 2 Numbers Protocol

1.4 C.4 Description of the Nested Product Protocol

1.5 C.5 Description of the Find Minimum of k Numbers Protocol

1.6 C.6 Description of the Change Modulus Protocol

1.7 C.7 Description of the Addition Modulo Unknown Value Protocol

1.8 C.8 Compute Modulus Mask Protocol

1.9 C.9 Compute \(\mathbf {e}_i\) Protocol

1.10 C.10 Choose \({\varvec{\mu }}_1\) Protocol

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation