CLUSTERING is one major task of exploratory data. Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset

Size: px

Start display at page:

Download "CLUSTERING is one major task of exploratory data. Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset"

Gordon Copeland
5 years ago
Views:

1 1 Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset Jiawei Yuan, Member, IEEE, Yifan Tian, Student Member, IEEE Abstract Clustering techniques have been widely adopted in many real world data analysis applications, such as customer behavior analysis, targeted marketing, digital forensics, etc. With the explosion of data in today s big data era, a major trend to handle a clustering over large-scale datasets is outsourcing it to public cloud platforms. This is because cloud computing offers not only reliable services with performance guarantees, but also savings on in-house IT infrastructures. However, as datasets used for clustering may contain sensitive information, e.g., patient health information, commercial data, and behavioral data, etc, directly outsourcing them to public cloud servers inevitably raise privacy concerns. In this paper, we propose a practical privacy-preserving K- means clustering scheme that can be efficiently outsourced to cloud servers. Our scheme allows cloud servers to perform clustering directly over encrypted datasets, while achieving comparable computational complexity and accuracy compared with clusterings over unencrypted ones. We also investigate secure integration of MapReduce into our scheme, which makes our scheme extremely suitable for cloud computing environment. Thorough security analysis and numerical analysis carry out the performance of our scheme in terms of security and efficiency. Experimental evaluation over a 5 million objects dataset further validates the practical performance of our scheme. Index Terms Privacy-preserving, K-means Clustering, Cloud Computing I. INTRODUCTION CLUSTERING is one major task of exploratory data mining and statistical data analysis, which has been ubiquitously adopted in many domains, including healthcare, social network, image analysis, pattern recognition, etc. Meanwhile, the rapid growth of big data involved in today s data mining and analysis also introduces challenges for clustering over them in terms of volume, variety, and velocity. To efficiently manage large-scale datasets and support clustering over them, public cloud infrastructure is acting the major role for both performance and economic consideration. Nevertheless, using public cloud services inevitably introduces privacy concerns. This is because not only many data involved in data mining applications are sensitive by nature, such as personal health information, localization data, financial data, etc, but also the public cloud is an open environment operated by external third-parties [1]. For example, a promising trend for predicting an individual s disease risk is clustering over existing patients health records [], which contain sensitive patient information according to the Health Insurance Portability and Accountability Act (HIPAA) Policy [3]. Therefore, appropriate privacy protection mechanisms must be placed when outsourcing sensitive datasets to the public cloud for clustering. The problem of privacy-preserving K-means clustering has been investigated under the multi-party secure computation model [4] [9], in which owners of distributed datasets interact for clustering without disclosing their own datasets to each other. In the multi-party setting, each party has a collection of data and wishes to collaborate with others in a privacypreserving manner to improve clustering accuracy. Differently, the dataset in clustering outsourcing is typically owned by a single entity, who aims at minimizing the local computation by delegating the clustering task to a third-party cloud server. In addition, existing multi-party designs always rely on powerful but expensive cryptographic primitives (e.g., secure circuit evaluation, homomorphic encryption and oblivious transfer) to achieve collaborative secure computation among multiple parties, and are inefficient for large-scale datasets. Thus, these multi-party designs are not practical for privacy-preserving outsourcing of clustering. Another line of research that targets at efficient privacy-preserving clustering is to use distancepreserving data perturbation or data transformation to encrypt datasets [10], [11]. Nevertheless, utilizing data perturbation and data transformation for privacy-preserving clustering may not achieve enough privacy and accuracy guarantee [1], [13]. For example, adversaries who get a few unencrypted data records in the dataset will be able to recover rest records protected by data transformation [1]. Recently, the outsourcing of K-means clustering is studied in ref [14] by utilizing homomophic encryption and order preserving index. However, the homomophic encryption utilized in [14] is not secure as pointed out in ref [15]. Moreover, due to the cost of relative expensive homomophic encryption, ref [14] is efficient only for small datasets, e.g., less than 50,000 data objects. Another possible candidate to achieve privacy-preserving K-means clustering is to extend existing privacy-preserving K-nearest neighbors (KNN) search schemes [16] [18]. Unfortunately, these privacy-preserving KNN search schemes are limited by the vulnerability to linear analysis attacks [16], the support up to two dimension data [17], or accuracy loss [18]. In addition, KNN is a single round search task, but K-means clustering is an iterative process that requires the update of clustering centers based on the entire dataset after each round of clustering. Considering the efficient support over large-scale datasets, these update processes also need to be outsourced to the cloud server in a privacy-preserving manner. Besides privacy protection, there are two other major factors in the outsourcing of K-means clustering: Clustering Efficiency and Clustering Accuracy. Specifically, a practical privacypreserving outsourcing of K-means clustering shall be easily parallelized, which is important in cloud computing environment for performance guarantee on large-scale datasets. Meanwhile, the computational cost of the dataset owner shall be minimized, i.e., the owner is only responsible for the system

setup as well as lightweight interactions with cloud servers.

dataset. Moreover, the privacy protection offered in an outsourced K-means design shall have slight (and even no) influence on the clustering accuracy.

To the best of our knowledge, there is no existing privacy-preserving MapReduce based K-means outsourcing design, which can achieve comparable efficiency and accuracy to the clustering over

2 setup as well as lightweight interactions with cloud servers. Although a number of MapReduce based K-means clustering schemes have been proposed to handle large-scale dataset in parallel [19] [1], none of them consider privacy protection for the outsourced dataset. Moreover, the privacy protection offered in an outsourced K-means design shall have slight (and even no) influence on the clustering accuracy. This is because accuracy is the key factor to determine the quality of a clustering algorithm. To the best of our knowledge, there is no existing privacy-preserving MapReduce based K-means outsourcing design, which can achieve comparable efficiency and accuracy to the clustering over unprotected datasets. In this work, we proposed a practical privacy-preserving K-means clustering scheme for large-scale datasets, which can be efficiently outsourced to public cloud servers. Our proposed scheme simultaneously meets the privacy, efficiency, and accuracy requirements as discussed above. In particular, we propose a novel encryption scheme based on the Learn with Error (LWE) hard problem [], which achieves privacypreserving similarity measurement of data objects directly over ciphertexts. Based on our encryption scheme, we further construct the whole K-means clustering process in a privacypreserving manner, in which cloud servers only have access to encrypted datasets, and will perform all operations without any decryption. Moreover, we uniquely incorporate MapReduce [3] into our scheme with privacy protection, and thus significantly improving the clustering performance in cloud computing environment. We provide thorough analysis for our scheme in terms of security and efficiency. We also implemented a prototype of our scheme on Microsoft Azure cloud. Our extensive evaluation results over 5 million objects show that our privacy-preserving clustering is efficient, scalable, and accurate. Specifically, compared with the K-means clustering over unencrypted datasets, our scheme achieves the same accuracy as well as comparable computational performance and scalability. The rest of this paper is organized as follows: In Section II, we explain our system model and threat model. Section III describes the background of K-means clustering and MapReduce framework. We provide the detailed construction of our scheme and its security analysis in Section IV. We evaluate the performance of our scheme in Section V, which is followed by the review of related work in Section VI. Section VII concludes this work. A. System Model II. MODELS In our design, we consider two major entities as shown in Fig.1: a Dataset Owner and a Cloud Server. The owner has a collection of data objects, which will be outsourced to the cloud server for clustering after encryption. The cloud server performs the K-means clustering directly over the encrypted dataset without any decryption. During the clustering, the cloud server interacts with the owner for a small amount of encrypted intermediate inputs/outputs. The clustering is finished when the clustering results do not change any more, or a predefined number of iterations is reached. Encrypted Dataset Outsourcing Intermediate Result Updated Encrypted Privacy-preserving Clustering Centers Dataset Owner Clustering Center? Update Fig. 1. System Architecture B. Threat Model?? Single Round Privacy-preserving Clustering: Cloud Server Final Clustering Results In this work, the cloud server is considered as honest-butcurious [4], i.e., the cloud server will honestly follow the designed protocol but try to disclose content of the dataset as much as possible. This assumption is consistent with existing works on privacy-preserving outsourcing in cloud computing [14], [5] [7]. Based on available information to the cloud server, we consider the following threat models in terms of the privacy protection of data. Ciphertext Only Model: The cloud server only has access to all encrypted data objects in the dataset, all encrypted clustering centers, and all intermediate outputs generated by the cloud server. Known Background Model: In this stronger threat model, the cloud server has all information as in the Ciphertext Only Model. In addition, the cloud server may have some background information about the dataset (e.g., what is the topic of the dataset?), and get a small number of data objects in the dataset. We also consider the cloud server is not able to obtain the clustering centers from background information, since they are generated based on all data objects on the fly during the clustering process. We assume the owner will not be compromised by adversaries. In our scheme, we should prevent the cloud server as well as outside adversaries from learning any data object and clustering center outsourced by the owner. III. BACKGROUND AND TECHNICAL PRELIMINARIES A. K-means Clustering K-means clustering algorithm aims at reallocate a set of data objects into k disjoint clusters, each of which has a center. For each data object, it will be assigned to the cluster whose center has shortest distance to the object. Data objects and centers can be denoted as multi-dimensional vectors, and their distances can be measured using the square of Euclidean distance. For example, the square of the distance for two m-dimensional vectors D 1 = [d 11, d 1,, d 1m ] and D = [d 0, d 1,, d m ] can be calculated as Dist( D 1, D ) = m (d 1j d j ). As shown in Algorithm 1, K-means clustering is an iterative processing. The algorithm selects k initial cluster centers, and all data objects are allocated into the cluster whose center has the shortest distance to them. After a round of clustering, centers of clusters are updated. Particularly, the new center of

3 Algorithm 1: K-means Clustering Input: k: number of clusters; max: a predefined number of iterations; n data objects D i = (d i1,, d im ), 1 i n Output: k clusters begin Select k initial cluster

3 3 Algorithm 1: K-means Clustering Input: k: number of clusters; max: a predefined number of iterations; n data objects D i = (d i1,, d im ), 1 i n Output: k clusters begin Select k initial cluster centers C x, 1 x k while max>0 do 1. Assign each data object D i to the cluster center with minimum distance Dist( D i, C x ) to it.. Update C x to the average value of those D i assigned to the cluster x. Output k reallocated clusters. MapReduce, intermediate <key,value> outputs with the same key are sent to the same reducer. After that, reducers sort and group all intermediate outputs in parallel to generate the final result. More details about MapReduce are introduced in ref [3]. a cluster is generated by averaging each element over all data object vectors in the same cluster. For example, if D x, D y, D z are assigned to the same cluster, the new center is calculated D as x+ D y+ D x 3. This clustering and center update process will conducted iteratively until data objects in each cluster do not change any more, or a predefined number of iterations is reached. For more details about K-means clustering, please refer to ref [8]. B. Weighted K-means Clustering A popular extension of the original K-means clustering is the weighted K-means clustering [9]. In weighted K- means clustering, every data element is associated with a real valued weight. This is because different elements in an object can have different level of importance. In weighted K-means clustering, a weight vector W = [w 1, w,, w m ] will be created for the dataset. Now, instead of directly using Euclidean distance for clustering measurement, the distance between a data object vector D = [d1, d,, d m ] and a clustering center C = [c1, c,, c m ] is calculated as W Dist( D, C, W ) = m w j(d j d j ). Other operations in weighted K-means clustering remain the same as the original K-means clustering. For more details about weighted K-means clustering, please refer to ref [9]. C. MapReduce Framework MapReduce is the programming framework for processing large scale datasets in a distributed manner. As shown in Fig., to process a task with massive amounts of data, MapReduce divides the task into two phases: map and reduce. These two phases are expressed with map and reduce functions, which take <key,value> pairs as input and output data format. In a cluster, nodes that are responsible for map and reduce functions are called mappers and reducers respectively. In a MapReduce task, the framework splits input datasets into data chunks, which are processed by independent mappers in parallel. Each map function processes data and generates intermediate output as <key,value> pairs. These intermediate outputs are forwarded to reducers after shuffle. According to the key space of <key,value> pairs in intermediate outputs, each reducer will be assigned with a partition of pairs. In Fig.. MapReduce Framework IV. CONSTRUCTION OF PRIVACY-PRESERVING MAPREDUCE BASED K-MEANS CLUSTERING A. Scheme Overview Our scheme consists of three stages as shown in Fig.3: 1) System Setup and Data Encryption; ) Single Round MapReduce Based Privacy-preserving Clustering 3) Privacypreserving Clustering Center Update. In Stage 1, the owner first setups the system by selecting parameters for K-Means and MapReduce. The owner then generates encryption keys for the system, and encrypts the dataset for clustering. In Stage, the cloud server performs a round of clustering and allocates encrypted objects to their closet clustering centers. After that, the cloud server returns a small amount of encrypted information back to the owner as the intermediate outputs. In Stage 3, the owner updates clustering centers based on information from the cloud server and his/her own secret keys. These new centers are sent to the cloud server in encrypted format for the next round of clustering. Stage and Stage 3 will be iteratively executed until the clustering result does not change any more or the predefined number of iterations is reached. Dataset Owner Stage1: System Setup and Data Encryption 1. Parameter Selection. System Key Generation 3. Data Encryption Stage 3: Privacy-preserving Clustering Center Update 1. Decrypt Aggregated Ciphertexts. Update and Re-encrypt Clustering Centers Fig. 3. Scheme Overview Encrypted Dataset & Encrypted Intial Clustering Centers Aggregated Ciphertexts for each cluster Updated Encrypted Clustering Centers Iterative Process Untial the Clustering is Finished Cloud Server Stage : Single Round MapReduce Based Privacy-preserving Clustering We now give the detailed construction of each stage in our scheme. We summarize important notations used in our con-

4 4 n Total Number of Data Objects in the Dataset m Total Number of Elements in a Data Object K Total Number of Clusters D i, D i Extended Data Object Vectors C k Extended Clustering Centers e i, e k Random Noise Vectors W Weight Vector M, Q Random Invertible Matrices M 1, Q 1 Inverses of M and Q I Identity Matrix E( ) Encryption SUM k Aggregated Ciphertexts of Data Objects List k Clustering Result List TABLE I NOTATION struction in Table I, and define two mathematical operations as below. Definition IV.1. For a R, define a to be the nearest integer of a, a q to be the nearest integer of a with modulus q. Definition IV.. For a vector D (or a matrix M), define max( D) (or max(m) ) to be maximum absolute value of their elements. B. Detailed Construction 1) Stage 1 - System Setup and Data Encryption: In our system, we consider a dataset with n data objects, which needs to be clustered into K clusters. Each data object contains m elements, and elements in all objects will be scaled to integers with the same scale factor. We denote each scaled data object as {d i1, d i,, d im } Z p. Given a data object, the owner first extends it to two m-dimensional vector as D i = [r i d i1, r i d i,, r i d im, r i, α i1, α i,, α im 1 ] D i = [d i1, d i,, d im, r i, α i1, α i,, α im 1 ] where r i, α i1, α i,, α i(m 1) Z p are random numbers selected by the owner for each data object, and r i is positive. D i will be used for privacy-preserving clustering, and D i will be used for privacy-preserving update of clustering centers. The owner also selects K initial clustering centers and extends them to m-dimensional vectors C k, 1 k K as C k = [c k1, c k,, c km, 1 c kj, β 1, β,, β m 1 ] where β 1, β,, β m 1 Z p are random numbers selected by the owner and will be regenerated for each round of clustering. Note that, there are different ways of selecting the initial centers [30], and our design is independent to how initial centers are selected. Key Generation: The key generation of our scheme involves the selection of two random m m invertible matrices M and Q. We also use M 1 and Q 1 to denote the inverses of M and Q respectively, where M M 1 = I, Q Q 1 = I, and I is a m m identity matrix. SK = {M, Q, M 1, Q 1 } is set as secret key for the system, and is only known to the owner. Data Encryption: Given extended data vectors D i and D i of a data object, the owner encrypts them as E( D i ) = (Γ D i + e i ) M (1) E( D i) = (Γ D i + e i ) Q () Here, Γ Z q, q p, Γ max( e i ), e i Z m q is a random integer noise vector generated for each D i. Then, for each extended clustering center C k, the owner encrypts it as E( C k ) = M 1 (Γ C T k + e T k ) (3) where C k T and et k are column vectors of C k and e k respectively, e k Z m q is a random integer noise vector generated for each clustering center C k. Considering the support of MapReduce, ciphertexts of each data object are organized as a key-value pair < i, E( D i ) E( D i ) >, where the index of the data object (i) is used as the key and the concatenation of E( D i ) and E( D i ) is set as the value. All n key-value pairs < i, E( D i ) E( D i ) > 1 i n, K encrypted clustering centers {E( C k )} 1 k K, and the public parameter Γ are outsourced to the cloud server. ) Stage - Single Round MapReduce Based Privacypreserving Clustering: As described in Section III-A, the clustering purpose for a data object is to find the clustering center that has the minimum Euclidean distance to it. As different data objects are independent with each other in a single round of clustering, we set the clustering for an object as the MapReduce job in our scheme. Now, the first task is to achieve Euclidean distance comparison directly over encrypted data objects and clustering centers. Given an encrypted object E( D i ) and any two encrypted clustering centers E( C a ), E( C b ), the cloud server first computes Comp ia = E(D i ) E(C a ) Γ q c m 1 aj = (r i d ij c aj r i ) + α ij β j ) + e i e T k Γ D + i e T k + e i C k T Γ = r i d ij c aj r i q = r i (Dist{C a, D i } Comp ib = r i (Dist{C b, D i } m 1 c aj + α ij β j m 1 d ij) + α ij β j m 1 d ij) + α ij β j The correctness of the above equations are guaranteed by the distributive property of matrix multiplication and the fact that Γ p, Γ max( e i ) and Γ max( e k ). Based on Comp ia and Comp ib, the cloud can easily output the difference between Dist{C a, D i } and Dist{C b, D i } as Comp ia Comp ib = r i (Dist{C b, D i } Dist{C a, D i }) (5) (4)

5 5 As r i is positive, it is clear that the output of encrypted distance comparison in Eq.5 is consistent with the comparison using exact distance between Dist{C a, D i } and Dist{C b, D i }. If the sign of Eq.5 is positive, clustering center C a has smaller distance to the data object; otherwise, clustering center C b has smaller distance. To further process privacy-preserving K-means clustering with MapReduce, the cloud server splits and distributes n encrypted data pairs < i, E( D i ) E( D i ) > 1 i n to all mappers. The cloud server also sends all K encrypted clustering centers {E( C k )} 1 k K to each mapper. Afterwards, each mapper initiates K m-dimensional vectors {SUM k } 1 k K, in which all elements are 0, i.e, {0, 0,, 0}. These vectors are used to aggregate encrypted data objects allocated to the same center, and will be utilized for the update of clustering centers in Stage 3. The map function in our privacy-preserving K-means clustering is shown in Fig.4. Taking an encrypted data keyvalue pair < i, E( D i ) E( D i ) > and all encrypted clustering centers {E( C k } 1 k K as inputs, a mapper iteratively invokes our privacy-preserving distance comparison described above to figure out the closest center for the data object. The intermediate outputs in our map function include two parts: 1) a key-value pair < k, i > with the index of the closest center k as key and the index of data object i as value; ) updated SUM k for the k-th center. Once a mapper finishes all jobs assigned to it, it will organize its final SUM k as keyvalue pairs < k, SUM k > 1 k K. Finally, all outputs will be sent to reducers. Map Process of Privacy-preserving K-means Input: K encrypted centers {E( C k )} 1 k K, an encrypted data pair < i, E( D i ) E( D i ) >, aggregation of encrypted vectors {SUM k } 1 k K Output: Key-value pair < k, i > and updated SUM k, k is the index of clustering center assigned for data object D i. 1. index = 1;. mincandidate = Comp i1 //Computed as Eq For (k = ; k< K+1; k++){ If (Comp ik -mincandidate> 0){ //Compared as Eq.5. mincandidate = Comp ik ; index = k; } } 4. Output a < k, i > pair, SUM k = SUM k + E( D i ); Fig. 4. Map Process of Privacy-preserving K-means The reduce function is presented in Fig.5. On receiving outputs from mappers, reducers first add indexes of data objects allocated to the same cluster into their corresponding result list List k, where 1 k K. Reducers also aggregate partially aggregated SUM k from each mapper as { SUM k } 1 k K, whose final values are SUM k = i List k E( D i ) Based on the output of reducers, the cloud server checks whether the predefined number of clustering iterations is researched, or whether all result lists List k, 1 k K are the same as the previous round of clustering. If so, the clustering Reduce Process of Privacy-preserving K-means Input: < k, i > pairs for all data objects, partially aggregated < k, SUM k >, 1 k K from each mapper. Output: K Classified Clusters < k, List k >, final aggregated < k, SUM k >, 1 k K for all n data objects. 1. While(< k, i > pairs. hasnext()){ Add i to List k ; }. While(< k, SUM k > pairs. hasnext()){ SUM k = SUM k + SUM k } 3. Output the < k, List k >, SUM k, 1 k K; Fig. 5. Reduce Process of Privacy-preserving K-means is finished and the cloud server sends List k, 1 k K back to the dataset owner as the clustering result; otherwise, the cloud server sends { SUM k } 1 k K back to the owner for the update of clustering centers. To this end, a single round of privacy-preserving MapReduce based K-means clustering is finished. 3) Stage 3 - Privacy-preserving Clustering Center Update: After a single round of clustering, clustering centers need to be updated as described in Section III-A. Particularly, the m elements in for a new clustering center is calculated as mean values of elements in data objects currently allocated to the cluster, i.e. c kj = 1 List k i List k d ij, where 1 j m and List k is the total number of data objects in the k-th cluster. To efficiently achieve this in a privacy-preserving manner, our scheme first utilizes the MapReduce design in Stage to generate aggregated ciphertexts of data objects allocated in the same cluster. Note that, only K aggregated ciphertexts { SUM k } 1 k K need to be retrieved by the owner, where K is the total number of clusters and each ciphertext is a mdimensional vector. Thus, the communication overhead for the interaction after each round of clustering is lightweight and independent to the size of the dataset. With { SUM k } 1 k K, the owner decrypts them using the secret key Q 1 and the public parameter Γ as C k = SUM k (Q) 1 = D i i List k Γ q As shown in Eq.6, the decryption of an aggregated ciphertext outputs the aggregation of corresponding data object vectors according to the associative and distributive properties of matrix multiplication. Given C k for the k-th cluster, the owner generates new center C C k as k List k. During the update, A (6) the owner only keeps the first m elements of each C k, since all rest elements are extended values or random numbers as described in the data encryption process of Stage 1. After K new clustering centers C k are generated, the owner extends them to m-dimension vectors and encrypts them using the encryption process presented in Stage 1.

6 6 All new encrypted centers are sent to the cloud server for the next round of clustering as described in Stage. Stage and Stage 3 are iteratively executed until the clustering is finished. To this end, all required operations in a K-means clustering are supported in the privacy-preserving manner in our construction. C. Extension to Weighted K-means Clustering As presented in Section III-B, weighted K-means clustering is similar to the original K-means clustering with only one difference in distance computation. In weighted K-means clustering, the weighted distance is calculated as W Dist( D, C, W ) = m w j(d j c j ), where W = [w 1, w,, w m ] is the weight vector for the dataset. Thus, to support weighted K-means clustering in a privacy-preserving manner, our design should enables the privacy-preserving weighted distance comparison. In particular, we only need to make the following change to the Stage 1 of our design introduced in Section IV-B: involving weight values into K extended clustering center vectors as C k = [w 1 c k1,, w m c km, 1 w j c kj, β 1,, β m 1 ] These extended center vectors will be encrypted using the same way as that in our design for original K-means. We now show that such a change can make our design support privacy-preserving weighted distance comparison, and further lead to privacy-preserving MapReduce based weighted K-means clustering. Specifically, given ciphertexts of a data object E( D i ), we compute Comp ia and Comp ib for clustering centers C a and C b as Comp ia = E(D i ) E(C a ) = Γ q r i d ij w j c aj r i Comp ib = E(D i ) E(C b ) = Γ q r i d ij w j c bj r i (7) m 1 w j c aj + α ij β j m 1 w j c bj + α ij β j After that, the weighted distance comparison can be conducted as Comp ia Comp ib = r i (w j c bj w j c aj + w j (c aj c bj ) d ij ) = r i w j (d ij + c bj d ij c bj (d ij + c aj d ij c aj )) = r i (W Dist( D i, C b, W ) W Dist( D i, C a, W )) In addition, since our extension does not make any change to the processing and encryption of the dataset, the later clustering center update is the same as that in our design for original K-means clustering. To this end, the privacypreserving weighted K-means clustering based on MapReduce can be supported. D. Security Analysis In this section, we show the security of our design under the Ciphertext Only Model and Known Background Model as described in Section II-B. We first prove the security of encryption scheme for all data objects and clustering centers based on the hardness assumption of the Learning With Error (LWE) problem [], which guarantees polynomial-time adversaries are not able to recover the owner s data directly from their ciphertexts. Definition IV.3. Learning with Error (LWE) Problem Given polynomially many samples of ( a i Z m q, b i Z q ) with b i = D a T i + γ i where the error term γ i Z q is drawn from some probability distribution, it is computational infeasible to recover the vector D with non-negligible probability. Theorem IV.4. If the LWE problem is hard, then it is computable infeasible for a polynomial-time adversary to recover D i from its ciphertexts E( D i ) and E( D i ), Ck from its ciphertext E( C k ) in our proposed scheme. Proof. In our encryption, each extended D i is encrypted as E( D i ) = (Γ D i + e i ) M (8) Since D i and D i are encrypted with in the same manner, we use D i in our proof for expression simplicity. In E( D i ), as D i and e i are m-dimensional vectors, their multiplication with the m m matrix M can be considered as m dot products of m-dimensional vectors as follows E( D i )(1) = Γ D ia M(1) + e i M(1) T E( D i )(m) = Γ D ia M(m) + e i M(m) T where 1 j m, E( D i )(j) is the j th element of E( D i ), and M(j) is the j th column of M. By denoting Γ M(j) as M (j) and e i M(j) as e i, we have m samples (M (j), E( D i )(j)) with E( D i )(j) = D ia M (j) + e i, 1 j m Therefore, recovering D i from ciphertext E( D i ) (respectively, D i from E( D i )) becomes the LWE problem presented in Definition IV.3, which is considered as hard problem and computational infeasible. It is notable that as M in our scheme is the secret key and is not available to the adversary, M (j) is actually also not available to the adversary. Such a fact makes the recovering of D i in our design more difficult than the LWE problem. With regard to any C k and its ciphertext E( C k ), we can also convert its encryption to m dot products similarly as E( C k )(j) = M 1 (j) C T k + e k), 1 j m

7 7 where M 1 (j) = Γ M 1 (j), M 1 (j) is the j th row of M 1, and e k = M 1 (j) e T k. Thus, recovering C k from E( C k ) also becomes the LWE problem by given m samples (M 1 (j), E( C k )(j)) 1 j m. In addition, M 1 (j) is also not available to the adversary, since M 1 is the secret key and only known to the dataset owner. To this end, if the LWE problem is hard, recovering data objects and clustering centers from their corresponding ciphertexts is computational infeasible for a polynomially-time adversary. Theorem IV.4 is proved. As the cloud server only has access to ciphertexts in the Ciphertext Only Model, our proposed scheme is secure in this threat model according to our proof of Theorem IV.4. We now further analyze the security of our scheme in the Known Background Model. Known Background Model: In this threat model, besides the ciphertexts available in the Ciphertext Only Model, the cloud server also has access to a small set of data objects from background information and analysis. In the KN N problem, in which all data objects as well as query objects are independent with each other and the cloud server is possible to obtain some query objects without knowing other data objects from background information. Differently, clustering centers in the K means clustering setting are generated according to all data objects involved in the dataset on the fly, and will be updated after each round of clustering. Thus, we consider the cloud server will not obtain clustering centers from background information and analysis. As the security of our encryption is guaranteed under the LWE problem, the cloud server is not able to recover data objects as well as clustering centers directly from ciphertexts. We now focus on linear analysis attack introduced by Yao et al. [17]. Instead of recovering data objects and clustering centers directly from their ciphertexts, this kind of attack attempts to recover the data from the Euclidean distance comparison results as shown in Eq.5. Specifically, given the comparison result equation of one data object and any two clustering centers, the cloud server can construct an equation Rst iab = r i (Dist{C b, D i } Dist{C a, D i }) In this equation, there are 3m + 1 unknowns from C a, Cb, D i, and r i. As the cloud has access to a small set of data objects, it can reduce the number of unknowns to m+1 (from C a, Cb and r i ) in an equation Rst iab if D i is in its known set of data objects. The original idea of the linear analysis attack is: if the cloud can obtain more than m data objects, it can construct m Rst iab equations for m unknowns in C a, C b, and solve them to recover C a and C b. However, such an attack cannot work in our design. This is because we embed different random numbers r i for each data object Di. With such a design, each additional equation Rst iab constructed from a newly obtained D i also introduces an unknown r i in this equation, and thus bring no contribution for recovering C a, C b. Therefore, when the cloud obtains m data objects, it can only construct m equations for solving 4m unknowns from C a, Cb, and m random numbers r i, which are unsolvable using linear analysis. To this end, our scheme can prevent the cloud server from learning data objects as well as clustering centers in the Known Background Model. A. Numerical Evaluation V. EVALUATION In this section, we numerically analyze our proposed scheme in terms of computational cost, communication overhead, and storage overhead. We also compare the cost of our scheme with the original K-means algorithm, and summarize the result in Table II. For expression simplicity, we use DOT m to denote one m-dimensional dot product operation in the rest parts of this paper. In particular, given two m-dimensional vectors A = [a1, a,, a m ] and B = [b1, b,, b m ], a DOT m operation for them is A B T = m = a j b j. We ignore single addition operation and single modular operation in our evaluation, since their cost are negligible compared to the DOT m operation. 1) Computational Cost: In our scheme, the dataset owner is responsible for key generation, dataset encryption, and clustering center update. As the key generation process only involves one-time selection of two random m m invertible matrices. To encrypt a data object with m elements, the owner needs to perform 4m DOT m operations as shown in Eq.1 and Eq.. Similarly, the encryption cost for a clustering center is m DOT m operations as shown in Eq.3. Therefore, given a n objects dataset and K clusters, the owner needs (4mn+mK) DOT m operations for one-time encryption. For each round of clustering center update, the owner first needs mk DOT m operations to decrypt intermediate outputs from the cloud server as shown in Eq.6. Then, another mk DOT m operations are needed to re-encrypt K updated clustering centers. Therefore, the total computational cost on the owner for a round of clustering is 4mK DOT m operations. With regard to the cloud server in each round of clustering, it needs K DOT m operations as shown in Eq.7 to allocate a data object into the closest center. Thus, to process n objects to K clusters in a single round of clustering, the computational cost on the cloud server is nk DOT m operations. Compared to the original K-means clustering that presented in Section III-A, the cloud server needs to compute K squares of Euclidean distance to allocate a data object in each round of clustering. Since the square of Euclidean distance between two m- dimensional vectors contains the same number of addition and multiplication operations as in a DOT m operation, we represent each square of Euclidean distance as a DOT m operation. To allocate n objects to K clusters in a round of clustering, the of original K-means clustering algorithm requires nk DOT m operations on the cloud server. Therefore, our scheme has the same computational complexity on the cloud server for each round of clustering. ) Communication Overhead: The communication overhead in our scheme mainly comes the interaction after each round of clustering. Specifically, if the clustering is not finished, the cloud server needs to send back K aggregated ciphertexts { SUM k } 1 k K, which are m-dimensional vectors. After that, the owner needs to return K ciphertexts {E( C ) k } 1 k K for updated clustering centers, which are

8 8 Computational Complexity (Cloud Server) Computational Complexity (Owner) Communication Overhead Our Scheme nk DOT m 4mK DOT m 4mK vector elements Original K-means nk DOT m N/A N/A *n: the number of data objects in the dataset; m: the number of elements in a data object; K: the number of clustering centers; Typically, n >> m, k TABLE II NUMERICAL ANALYSIS FOR A SINGLE ROUND OF CLUSTERING also m-dimensional vectors. Thus the total communication cost in each round of clustering are 4mK vector elements, each of which is 8 Bytes in our implementation. It is notable that the interaction cost for each round of clustering is independent to size of the dataset, which makes our scheme scalable for large-scale datasets. Although the communication overhead is still linear to the number of clustering centers, it is typically a small number in a practical clustering. 3) Storage Overhead: The storage overhead is introduced in the encryption of dataset and clustering centers. In our scheme, each data object and clustering center are denoted as a m-dimensional vector, and will be encrypted as two mdimensional vectors. Thus, the total storage cost in our scheme is four times to that of the unprotected clustering. 4) Comparison with ref [16]: In this section, we also compare our proposed encryption scheme with that in ref [16], and summarize the results in Table III. As ref [16] is not designed for K means, our comparison focus on the major operation in ref [16], i.e., privacy-preserving Euclideandistance comparison. Specifically, given a data object vector with m elements, we assume both our scheme and ref [16] extend it to m elements by adding random elements (artificial elements in [16] respectively). To encrypt the extended vector, say D, ref [16] first split it into two vector D A and D B, and then encrypt them as two m-dimensional vectors E( D A ) and E( D B ) using matrix multiplication. Differently, our scheme directly encrypts D and outputs only one mdimensional vector E( D) as ciphertext. As all ciphertexts will be outsourced to the cloud server for further process, it is clear that the storage overhead introduced in ref [16] is twice of that of our scheme. With regard to the Euclidean distance comparison process on cloud, i.e., figuring out the vector from n encrypted vectors that has the smallest Euclidean distance to an encrypted request vector, ref [16] requires n DOT m operations. Differently, our proposed scheme only requires n DOT m operations. This is because ref [16] needs process two ciphertexts of each vector in the dataset, while only one is necessary in our scheme. Similarly, to encrypt a request vector, ref [16] needs two DOT m operations on the dataset owner, but our scheme only requires one DOT m operation. Therefore, compared with ref [16], our scheme saves about 50% computational cost on the cloud server and data owner for privacy-preserving Euclidean distance comparison. B. Experimental Evaluation 1) Experiment Configuration: To evaluate the performance of our privacy-preserving MapReduce based K-means clustering scheme in terms of efficiency, scalability, and accuracy, Our Scheme Ref [16] Computational Cost n DOT m n DOT m (Cloud Server) Computational Cost DOT m DOT m (Owner) Storage Overhead nm vector elements 4nm vector elements TABLE III NUMERICAL ANALYSIS OF PRIVACY-PRESERVING EUCLIDEAN DISTANCE COMPARISON we implemented a prototype on Microsoft Azure cloud [31] using JAVA 1.7. We deployed a cluster of 6 to 10 nodes with Apache Hadoop.6.3 [3] installed. Two nodes are used as head nodes and the other to 8 are used as worker nodes. Each node is running Ubuntu Linux 1.04 with four.40ghz CPU cores and 14GB memory. The local machine for the dataset owner is a desktop running OS X with 8 3.3GHz CPU cores and 8GB memory. To support matrix related operations in our scheme, jblas library 1..4 [33] is adopted in the implementation. The dataset used in our evaluation consists of 5 million simulated data objects. Each object has 10 elements, and can be represented as a 10-dimensional vector. These objects need to be allocated into 10 clusters. To demonstrate that our scheme introduces reasonable computation and communication overhead for privacy guarantee, we also implemented a non-privacy-preserving MapReduce based K-means under the same configuration. All experimental results represent the mean of 10 trials. Encryption Cost (Second) Number of Data Objects (million) Fig. 6. Dataset Encryption Cost ) System Setup: As discussed in Section V-A1, the system setup cost mainly comes the dataset encryption by the dataset owner. As shown in Fig.6, the dataset encryption cost increases

9 9 linearly from 1.6s to 6.34s when we change the size of dataset from 1 million objects to 5 million objects. Note that, this is one-time cost in our scheme, and will not affect the later clustering performance. 3) Efficiency: In our evaluation, we focus on evaluating the efficiency of a single round clustering, because different rounds of clusterings have the same computational cost on the owner and the cloud server as shown in Section V-A1. In addition, the number of clustering rounds is mainly determined by the dataset itself and the selection of initial clustering centers, and is independent to the design our scheme. Single Round Clustering Cost (second) Our Scheme - Cloud Server Cost Our Scheme - Total Cost Non-Privacy-preserving Solution Number of Data Objects (million) Fig. 7. Computational Cost on for a Single Round of Clustering By using 4 worker nodes, Fig.7 shows the cloud server spends 5.05s to 10.s to perform a single round of clustering over datasets from 1 million to 5 million. Compared with the non-privacy-preserving version, our scheme only brings within 35% more computational cost on cloud for a single round of clustering. This is consistent with our analysis in Section V-A1, since our scheme achieves the same computational complexity on the cloud server compared to that of a nonprivacy-preserving design. After each round of clustering, the dataset owner needs to update 10 clustering centers, which only costs 65ms. The total communication overhead after each round of clustering is 3.KB 1, in which 1.6KB are aggregated ciphertexts returned by the cloud server and the other 1.6KB are updated clustering centers uploaded by the owner. It is notable that our communication overhead is independent to the size of the dataset as shown in Fig. 8. This decent feature also promotes the scalability of our scheme for large-scale datasets. Using an 100Mb bandwidth Internet in our experiment, the communication after each round of clustering spends 1.34s. Therefore, the total cost for a single round of clustering starts from 6.39s to 11.56s when the size of dataset varies from 1 million to 5 million as shown in Fig. 7. 4) Scalability: We evaluate the scalability of our scheme with respect to scaleup [34]. Specifically, scaleup is the ability of using m-times larger resources to perform a m-times larger job in the same running time as the original job. Thus, 1 Each element in a data object is formated as Long in Java, which is 8 Bytes. Single Round Communication Overhead (KB) Number of Data Objects (million) Fig. 8. Communication Overhead for a Single Round of Clustering given a original job time T, the scaleup rate is defined as % of Job F inished in T 100%. In our evaluation, the original job is set as the clustering over 1 million objects using two worker nodes. We then increase the number of worker nodes to 4, 6, 8 and the number of data objects as million, 3 million, and 4 million respectively. As demonstrated in Fig.9, -times, 3-times, and 4-times scaleups in our scheme have scaleup rate at 0.9, 0.84, and 0.73 respectively, which is comparable with the scalability of the non-privacy-preserving MapReduce based K-means clustering [19]. Scaleup Rate m-times Fig. 9. Scaleup Evaluation Our Scheme Non-Privacy-preserving Solution 5) Accuracy: Compared with the original K-means clustering algorithms, our scheme does not introduce any accuracy loss if all initial clustering centers are selected in the same way. In particular, the allocation of a data object to the closest center is determined by the Euclidean distance between the object and the center. As discussed in Stage of our scheme, the Euclidean comparison result over encrypted data in our scheme is exactly the same as that over unprotected data. Moreover, the update of clustering centers is also the same as that in the original K-means clustering. Therefore, our scheme achieves the same accuracy compared with the original K-

10 10 Our Scheme Original K-means Cluster # Iterations = TABLE IV ACCURACY COMPARISON means clustering. As a proof of concept, we perform a 100- round clustering over one million data objects. The clustering results of our scheme and the original K-means algorithm are compared by the number of data objects in each cluster after the same number of clustering rounds. As shown in Table IV, our scheme has the same clustering results as the original K- means after 100 rounds of clustering. VI. RELATED WORK A. Privacy-preserving Clustering In recent years, a number of schemes have been proposed to outsource clustering tasks in a privacy-preserving manner. In ref [10], [11], distance-preserving data perturbation or data transformation techniques are adopted to protect the privacy of dataset, while keeping the distance comparison property for clustering purpose. These perturbation based techniques are very efficient and even achieve the same computational cost compared to the original clustering algorithm. This is because data perturbation based encryption makes the ciphertxt have the same size as the original data, and uses the same clustering operations in the original clustering algorithm. However, as shown in ref [1], [13], these data perturbation based solutions do not provide enough privacy guarantee. Specifically, once adversaries get a small set of unencrypted data objects in the dataset from background analysis, they will be able to recover the rest objects [1]. To provide strong privacy guarantee, novel cryptographic primitives are adopted in privacy-preserving clustering outsourcing. In ref [14], a privacy-preserving outsourcing design for K-means clustering is proposed by utilizing homomophic encryption and order preserving indexes. Nevertheless, as shown in ref [15], the homomophic encryption adopted in ref [14] is not secure. Moreover, ref [14] is efficient only for small datasets, e.g., less than 50,000 data objects. As a comparison, ref [14] requires seconds for a single round clustering over only 30,000 data objects, while our proposed scheme can achieve a single round clustering over 5 million data objects within 15 seconds as evaluated in Section V-B3. The other promising candidate for privacy-preserving outsourcing of K-means clustering is the secure outsourcing of general linear algebra computations [35], since all required operations in K-means clustering can be converted to linear algebra computations. However, general secure computation outsourcing mainly focuses on one-round computation, while K-means clustering is an iterative process and needs the update of ciphertext for each round. The problem of privacy-preserving clustering has also been studied in the distributed setting [4] [9]. These schemes mainly rely on multi-party secure computation techniques, such as secure circuit evaluation, homomorphic encryption and oblivious transfer. Nevertheless, privacy-preserving distributed clustering has a different purpose with privacy-preserving outsourcing of clustering. These designs involve multiple entities, which perform clustering over their shared data without disclosing their data to each other. Differently, the dataset in clustering outsourcing is owned by a single entity, who wants to minimize local computational cost for large-scale clustering. Another line of research that is related to this work is privacy-preserving KNN search, since both K-means and KNN use Euclidean distance to measure the similarity of data vectors. An efficient matrix based privacy-preserving KNN search scheme is first proposed by Wong et al. [16], in which they convert the Euclidean distance comparison to scalar product computation. Nevertheless, as demonstrated by Yao et al. [17], ref [16] is vulnerable to the linear analysis attack when the cloud server obtains a set of data objects from the dataset. To overcome such a security vulnerability, Yao et al. [17] present a secure solution by adopting a novel partition-based secure Voronoi diagram design. Unfortunately, their scheme only supports data with no more than two dimensions, and thus becomes impractical for most types of data in the domain of clustering. Recently, Su et al. [18] further modify ref [16] by adding a random noise term for each data vector, which leads their scheme to resist the linear analysis attack. However, their design also sacrifices the search accuracy to some extent, since the added noise terms will be included in the Euclidean distance comparison. Differently, our proposed scheme can support data of any number of dimensions, is resistant to linear analysis attacks as shown in Section IV-D, and does not introduce any accuracy loss. Furthermore, considering privacypreserving Euclidean distance comparison only, our scheme also significantly reduce computational cost and storage overhead compared with ref [16] as discussed in Section V-A4. In addition, extending privacy-preserving KNN to support the outsourcing of K-means clustering is not a trivial task. Unlike the KNN search that is a single round task, K-means clustering is an iterative process and requires the update of clustering centers based on all objects in the dataset after each round of clustering. To guarantee the efficiency and privacy of the entire clustering process, our scheme uniquely makes these updates compatible with MapReduce and allows them to be mainly handled by the cloud server over ciphertexts. Particularly, the dataset owner only needs to perform a constant number of operations for the update of clustering centers as evaluated in Section V, which is independent to the size of the large-scale dataset. B. MapReduce Based K-means Clustering To efficiently perform K-means clustering over large-scale dataset, MapReduce framework [3] has been frequently adopted by researchers [19] [1]. In ref [19], a fast parallel k- means clustering algorithm based on MapReduce is proposed.

11 11 Later on, ref [0] parallelized the initial phase of an existing efficient K-means algorithm and implemented it using MapReduce. After that, the performance of MapReduce based K- means is further optimized for large-scale dataset by ref [1]. Similarity, the MapReduce is adopted in ref [36] for efficient earth movers distance similarity join over large-scale datasets. Nevertheless, all these designs only focus on improving the computational performance over large-scale datasets, and none of them take privacy protection into consideration. VII. CONCLUSION In this work, we proposed a privacy-preserving MapReduce based K-means clustering scheme in cloud computing. Thanks to our light-weight encryption design based on the LWE hard problem, our scheme achieves clustering speed and accuracy that are comparable to the K-means clustering without privacyprotection. Considering the support of large-scale dataset, we securely integrated MapReduce framework into our design, and make it extremely suitable for parallelized processing in cloud computing environment. In addition, the privacypreserving Euclidean distance comparison component proposed in our design can also be used as an independent tool for distance based applications. We provide thorough analysis to show the security and efficiency of our scheme. Our prototype implementation over 5 million data objects demonstrates that our scheme is efficient, scalable, and accurate for K-means clustering over large-scale dataset. REFERENCES [1] European Network and Information Security Agency. Cloud computing security risk assessment. [] Darcy A. Davis, Nitesh V. Chawla, Nicholas Blumm, Nicholas Christakis, and Albert-László Barabasi. Predicting individual disease risk based on medical history. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 08, pages , Napa Valley, California, USA, 008. [3] U.S. Dept. of Health & Human Services. Standards for privacy of individually identifiable health information, final rule, 45 cfr, pt [4] Jaideep Vaidya and Chris Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 03, pages 06 15, New York, NY, USA, 003. ACM. [5] Geetha Jagannathan and Rebecca N. Wright. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD 05, pages , New York, NY, USA, 005. ACM. [6] Paul Bunn and Rafail Ostrovsky. Secure two-party k-means clustering. In Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS 07, pages , New York, NY, USA, 007. ACM. [7] Mahir Can Doganay, Thomas B. Pedersen, Yücel Saygin, Erkay Savaş, and Albert Levi. Distributed privacy preserving k-means clustering with additive secret sharing. In Proceedings of the 008 International Workshop on Privacy and Anonymity in Information Society, PAIS 08, pages 3 11, New York, NY, USA, 008. ACM. [8] Jun Sakuma and Shigenobu Kobayashi. Large-scale k-means clustering with user-centric privacy-preservation. Knowledge and Information Systems, 5():53 79, 009. [9] Xun Yi and Yanchun Zhang. Equally contributory privacy-preserving k-means clustering over vertically partitioned data. Inf. Syst., 38(1):97 107, March 013. [10] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. SIGMOD Rec., 9(): , May 000. [11] Stanley R. M. Oliveira and Osmar R. Zaane. Privacy preserving clustering by data transformation. In Brazilian Symposium on Databases, SBBD, Manaus, Amazonas, Brazil, 003. [1] Kun Liu, Chris Giannella, and Hillol Kargupta. An attacker s view of distance preserving maps for privacy preserving data mining. In Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases, PKDD 06, pages , Berlin, Heidelberg, 006. Springer-Verlag. [13] H. Kargupta, S. Datta, Q. Wang, and Krishnamoorthy Sivakumar. On the privacy preserving properties of random data perturbation techniques. In Data Mining, 003. ICDM 003. Third IEEE International Conference on, pages , Nov 003. [14] Dongxi Liu, Elisa Bertino, and Xun Yi. Privacy of outsourced k-means clustering. In Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, ASIA CCS 14, pages , New York, NY, USA, 014. ACM. [15] Yongge Wang. Notes on two fully homomorphic encryption schemes without bootstrapping. Cryptology eprint Archive, Report 015/519, 015. [16] Wai Kit Wong, David Wai-lok Cheung, Ben Kao, and Nikos Mamoulis. Secure knn computation on encrypted databases. In Proceedings of the 009 ACM SIGMOD International Conference on Management of data, SIGMOD 09, pages , New York, NY, USA, 009. ACM. [17] B. Yao, F. Li, and X. Xiao. Secure nearest neighbor revisited. In Data Engineering (ICDE), 013 IEEE 9th International Conference on, pages , April 013. [18] Sen Su, Yiping Teng, Xiang Cheng, Yulong Wang, and Guoliang Li. Privacy-preserving top-k spatial keyword queries over outsourced database. In Proceedings of the 0th International Conference on Database Systems for Advanced Applications, DASFAA 15, pages , 015. [19] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In Proceedings of the 1st International Conference on Cloud Computing, CloudCom 09, pages , Berlin, Heidelberg, 009. Springer-Verlag. [0] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. 5(7):6 633, March 01. [1] Xiaoli Cui, Pingfei Zhu, Xin Yang, Keqiu Li, and Changqing Ji. Optimized big data k-means clustering using mapreduce. J. Supercomput., 70(3): , December 014. [] Zvika Brakerski, Craig Gentry, and Shai Halevi. Packed ciphertexts in lwe-based homomorphic encryption. In 16th International Conference on Practice and Theory in Public-Key Cryptography (PKC), pages 1 13, February 013. [3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1): , January 008. [4] Sabrina De Capitani di Vimercati, Sara Foresti, Sushil Jajodia, Stefano Paraboschi, and Pierangela Samarati. Over-encryption: management of access control evolution on outsourced data. In Proceedings of the 33rd international conference on Very large data bases, VLDB 07, pages VLDB Endowment, 007. [5] Jiawei Yuan and Shucheng Yu. Privacy preserving back-propagation neural network learning made practical with cloud computing. IEEE Transactions on Parallel and Distributed Systems, 5(1):1 1, 014. [6] Ning Cao, Zhenyu Yang, Cong Wang, Kui Ren, and Wenjing Lou. Privacy-preserving query over encrypted graph-structured data in cloud computing. In Distributed Computing Systems (ICDCS), st International Conference on, pages , 011. [7] Jiawei Yuan and Shucheng Yu. Efficient Privacy-Preserving biometric identification in cloud computing. In 013 Proceedings IEEE INFOCOM (INFOCOM 013), pages , Turin, Italy, April 013. [8] Jiawei Han and Micheline Kamber. Chapter 7, Data Mining: Concepts and Techniques, Second Edition. Morgan Kaufmann, 006. [9] Ackerman Margareta, Ben-David Shai, Branzei Simina, and Loker David. Weighted clustering. pages , 01. [30] Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. J. ACM, 59(6):8:1 8:, January 013. [31] Microsoft azure cloud. [3] Apache hadoop. [33] Mikio L. Braun. jblas library. [34] Xiaowei Xu, Jochen Jäger, and Hans-Peter Kriegel. A fast parallel clustering algorithm for large spatial databases. Discov., 3(3):63 90, September Data Min. Knowl.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.

In Proceedings of the 5th ACM Symposium on Information, Computer and Communications Security, ASIACCS 10, pages 48 59, New York, NY, USA, 010. ACM. [36] J. Huang, R. Zhang, R. Buyya, and J. Chen.

12 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI /TCC , IEEE 1 [35] Mikhail J. Atallah and Keith B. Frikken. Securely outsourcing linear algebra computations. In Proceedings of the 5th ACM Symposium on Information, Computer and Communications Security, ASIACCS 10, pages 48 59, New York, NY, USA, 010. ACM. [36] J. Huang, R. Zhang, R. Buyya, and J. Chen. Melody-join: Efficient earth mover s distance similarity joins using mapreduce. In 014 IEEE 30th International Conference on Data Engineering, pages , March 014. Jiawei Yuan (S 11-M 15) is an assistant professor of Computer Science in the Dept. of ECSSE at Embry-Riddle Aeronautical University since 015. He received his Ph.D in 015 from University of Arkansas at Little Rock, and a BS in 011 from University of Electronic Science and Technology of China. His research interests are in the areas of cyber-security and privacy in cloud computing and big data, ehealth security, and applied cryptography. He is a member of IEEE. Yifan Tian (S 16) is a Ph.D student at EmbryRiddle Aeronautical University since 016. He received his M.S in 015 from Johns Hopkins University, and BS in 014 from Tongji University, China. His research interests are in the areas of cyber security and network security, with current focus on secure computation outsourcing. He is a student member of IEEE.

PassBio: Privacy-Preserving User-Centric Biometric Authentication

1 PassBio: Privacy-Preserving User-Centric Biometric Authentication Kai Zhou and Jian Ren arxiv:1711.04902v1 [cs.cr] 14 Nov 2017 Abstract The proliferation of online biometric authentication has necessitated