CLUSTERING is one major task of exploratory data. Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset

Size: px
Start display at page:

Download "CLUSTERING is one major task of exploratory data. Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset"

Transcription

1 1 Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset Jiawei Yuan, Member, IEEE, Yifan Tian, Student Member, IEEE Abstract Clustering techniques have been widely adopted in many real world data analysis applications, such as customer behavior analysis, targeted marketing, digital forensics, etc. With the explosion of data in today s big data era, a major trend to handle a clustering over large-scale datasets is outsourcing it to public cloud platforms. This is because cloud computing offers not only reliable services with performance guarantees, but also savings on in-house IT infrastructures. However, as datasets used for clustering may contain sensitive information, e.g., patient health information, commercial data, and behavioral data, etc, directly outsourcing them to public cloud servers inevitably raise privacy concerns. In this paper, we propose a practical privacy-preserving K- means clustering scheme that can be efficiently outsourced to cloud servers. Our scheme allows cloud servers to perform clustering directly over encrypted datasets, while achieving comparable computational complexity and accuracy compared with clusterings over unencrypted ones. We also investigate secure integration of MapReduce into our scheme, which makes our scheme extremely suitable for cloud computing environment. Thorough security analysis and numerical analysis carry out the performance of our scheme in terms of security and efficiency. Experimental evaluation over a 5 million objects dataset further validates the practical performance of our scheme. Index Terms Privacy-preserving, K-means Clustering, Cloud Computing I. INTRODUCTION CLUSTERING is one major task of exploratory data mining and statistical data analysis, which has been ubiquitously adopted in many domains, including healthcare, social network, image analysis, pattern recognition, etc. Meanwhile, the rapid growth of big data involved in today s data mining and analysis also introduces challenges for clustering over them in terms of volume, variety, and velocity. To efficiently manage large-scale datasets and support clustering over them, public cloud infrastructure is acting the major role for both performance and economic consideration. Nevertheless, using public cloud services inevitably introduces privacy concerns. This is because not only many data involved in data mining applications are sensitive by nature, such as personal health information, localization data, financial data, etc, but also the public cloud is an open environment operated by external third-parties [1]. For example, a promising trend for predicting an individual s disease risk is clustering over existing patients health records [], which contain sensitive patient information according to the Health Insurance Portability and Accountability Act (HIPAA) Policy [3]. Therefore, appropriate privacy protection mechanisms must be placed when outsourcing sensitive datasets to the public cloud for clustering. The problem of privacy-preserving K-means clustering has been investigated under the multi-party secure computation model [4] [9], in which owners of distributed datasets interact for clustering without disclosing their own datasets to each other. In the multi-party setting, each party has a collection of data and wishes to collaborate with others in a privacypreserving manner to improve clustering accuracy. Differently, the dataset in clustering outsourcing is typically owned by a single entity, who aims at minimizing the local computation by delegating the clustering task to a third-party cloud server. In addition, existing multi-party designs always rely on powerful but expensive cryptographic primitives (e.g., secure circuit evaluation, homomorphic encryption and oblivious transfer) to achieve collaborative secure computation among multiple parties, and are inefficient for large-scale datasets. Thus, these multi-party designs are not practical for privacy-preserving outsourcing of clustering. Another line of research that targets at efficient privacy-preserving clustering is to use distancepreserving data perturbation or data transformation to encrypt datasets [10], [11]. Nevertheless, utilizing data perturbation and data transformation for privacy-preserving clustering may not achieve enough privacy and accuracy guarantee [1], [13]. For example, adversaries who get a few unencrypted data records in the dataset will be able to recover rest records protected by data transformation [1]. Recently, the outsourcing of K-means clustering is studied in ref [14] by utilizing homomophic encryption and order preserving index. However, the homomophic encryption utilized in [14] is not secure as pointed out in ref [15]. Moreover, due to the cost of relative expensive homomophic encryption, ref [14] is efficient only for small datasets, e.g., less than 50,000 data objects. Another possible candidate to achieve privacy-preserving K-means clustering is to extend existing privacy-preserving K-nearest neighbors (KNN) search schemes [16] [18]. Unfortunately, these privacy-preserving KNN search schemes are limited by the vulnerability to linear analysis attacks [16], the support up to two dimension data [17], or accuracy loss [18]. In addition, KNN is a single round search task, but K-means clustering is an iterative process that requires the update of clustering centers based on the entire dataset after each round of clustering. Considering the efficient support over large-scale datasets, these update processes also need to be outsourced to the cloud server in a privacy-preserving manner. Besides privacy protection, there are two other major factors in the outsourcing of K-means clustering: Clustering Efficiency and Clustering Accuracy. Specifically, a practical privacypreserving outsourcing of K-means clustering shall be easily parallelized, which is important in cloud computing environment for performance guarantee on large-scale datasets. Meanwhile, the computational cost of the dataset owner shall be minimized, i.e., the owner is only responsible for the system

2 setup as well as lightweight interactions with cloud servers. Although a number of MapReduce based K-means clustering schemes have been proposed to handle large-scale dataset in parallel [19] [1], none of them consider privacy protection for the outsourced dataset. Moreover, the privacy protection offered in an outsourced K-means design shall have slight (and even no) influence on the clustering accuracy. This is because accuracy is the key factor to determine the quality of a clustering algorithm. To the best of our knowledge, there is no existing privacy-preserving MapReduce based K-means outsourcing design, which can achieve comparable efficiency and accuracy to the clustering over unprotected datasets. In this work, we proposed a practical privacy-preserving K-means clustering scheme for large-scale datasets, which can be efficiently outsourced to public cloud servers. Our proposed scheme simultaneously meets the privacy, efficiency, and accuracy requirements as discussed above. In particular, we propose a novel encryption scheme based on the Learn with Error (LWE) hard problem [], which achieves privacypreserving similarity measurement of data objects directly over ciphertexts. Based on our encryption scheme, we further construct the whole K-means clustering process in a privacypreserving manner, in which cloud servers only have access to encrypted datasets, and will perform all operations without any decryption. Moreover, we uniquely incorporate MapReduce [3] into our scheme with privacy protection, and thus significantly improving the clustering performance in cloud computing environment. We provide thorough analysis for our scheme in terms of security and efficiency. We also implemented a prototype of our scheme on Microsoft Azure cloud. Our extensive evaluation results over 5 million objects show that our privacy-preserving clustering is efficient, scalable, and accurate. Specifically, compared with the K-means clustering over unencrypted datasets, our scheme achieves the same accuracy as well as comparable computational performance and scalability. The rest of this paper is organized as follows: In Section II, we explain our system model and threat model. Section III describes the background of K-means clustering and MapReduce framework. We provide the detailed construction of our scheme and its security analysis in Section IV. We evaluate the performance of our scheme in Section V, which is followed by the review of related work in Section VI. Section VII concludes this work. A. System Model II. MODELS In our design, we consider two major entities as shown in Fig.1: a Dataset Owner and a Cloud Server. The owner has a collection of data objects, which will be outsourced to the cloud server for clustering after encryption. The cloud server performs the K-means clustering directly over the encrypted dataset without any decryption. During the clustering, the cloud server interacts with the owner for a small amount of encrypted intermediate inputs/outputs. The clustering is finished when the clustering results do not change any more, or a predefined number of iterations is reached. Encrypted Dataset Outsourcing Intermediate Result Updated Encrypted Privacy-preserving Clustering Centers Dataset Owner Clustering Center? Update Fig. 1. System Architecture B. Threat Model?? Single Round Privacy-preserving Clustering: Cloud Server Final Clustering Results In this work, the cloud server is considered as honest-butcurious [4], i.e., the cloud server will honestly follow the designed protocol but try to disclose content of the dataset as much as possible. This assumption is consistent with existing works on privacy-preserving outsourcing in cloud computing [14], [5] [7]. Based on available information to the cloud server, we consider the following threat models in terms of the privacy protection of data. Ciphertext Only Model: The cloud server only has access to all encrypted data objects in the dataset, all encrypted clustering centers, and all intermediate outputs generated by the cloud server. Known Background Model: In this stronger threat model, the cloud server has all information as in the Ciphertext Only Model. In addition, the cloud server may have some background information about the dataset (e.g., what is the topic of the dataset?), and get a small number of data objects in the dataset. We also consider the cloud server is not able to obtain the clustering centers from background information, since they are generated based on all data objects on the fly during the clustering process. We assume the owner will not be compromised by adversaries. In our scheme, we should prevent the cloud server as well as outside adversaries from learning any data object and clustering center outsourced by the owner. III. BACKGROUND AND TECHNICAL PRELIMINARIES A. K-means Clustering K-means clustering algorithm aims at reallocate a set of data objects into k disjoint clusters, each of which has a center. For each data object, it will be assigned to the cluster whose center has shortest distance to the object. Data objects and centers can be denoted as multi-dimensional vectors, and their distances can be measured using the square of Euclidean distance. For example, the square of the distance for two m-dimensional vectors D 1 = [d 11, d 1,, d 1m ] and D = [d 0, d 1,, d m ] can be calculated as Dist( D 1, D ) = m (d 1j d j ). As shown in Algorithm 1, K-means clustering is an iterative processing. The algorithm selects k initial cluster centers, and all data objects are allocated into the cluster whose center has the shortest distance to them. After a round of clustering, centers of clusters are updated. Particularly, the new center of

3 3 Algorithm 1: K-means Clustering Input: k: number of clusters; max: a predefined number of iterations; n data objects D i = (d i1,, d im ), 1 i n Output: k clusters begin Select k initial cluster centers C x, 1 x k while max>0 do 1. Assign each data object D i to the cluster center with minimum distance Dist( D i, C x ) to it.. Update C x to the average value of those D i assigned to the cluster x. Output k reallocated clusters. MapReduce, intermediate <key,value> outputs with the same key are sent to the same reducer. After that, reducers sort and group all intermediate outputs in parallel to generate the final result. More details about MapReduce are introduced in ref [3]. a cluster is generated by averaging each element over all data object vectors in the same cluster. For example, if D x, D y, D z are assigned to the same cluster, the new center is calculated D as x+ D y+ D x 3. This clustering and center update process will conducted iteratively until data objects in each cluster do not change any more, or a predefined number of iterations is reached. For more details about K-means clustering, please refer to ref [8]. B. Weighted K-means Clustering A popular extension of the original K-means clustering is the weighted K-means clustering [9]. In weighted K- means clustering, every data element is associated with a real valued weight. This is because different elements in an object can have different level of importance. In weighted K-means clustering, a weight vector W = [w 1, w,, w m ] will be created for the dataset. Now, instead of directly using Euclidean distance for clustering measurement, the distance between a data object vector D = [d1, d,, d m ] and a clustering center C = [c1, c,, c m ] is calculated as W Dist( D, C, W ) = m w j(d j d j ). Other operations in weighted K-means clustering remain the same as the original K-means clustering. For more details about weighted K-means clustering, please refer to ref [9]. C. MapReduce Framework MapReduce is the programming framework for processing large scale datasets in a distributed manner. As shown in Fig., to process a task with massive amounts of data, MapReduce divides the task into two phases: map and reduce. These two phases are expressed with map and reduce functions, which take <key,value> pairs as input and output data format. In a cluster, nodes that are responsible for map and reduce functions are called mappers and reducers respectively. In a MapReduce task, the framework splits input datasets into data chunks, which are processed by independent mappers in parallel. Each map function processes data and generates intermediate output as <key,value> pairs. These intermediate outputs are forwarded to reducers after shuffle. According to the key space of <key,value> pairs in intermediate outputs, each reducer will be assigned with a partition of pairs. In Fig.. MapReduce Framework IV. CONSTRUCTION OF PRIVACY-PRESERVING MAPREDUCE BASED K-MEANS CLUSTERING A. Scheme Overview Our scheme consists of three stages as shown in Fig.3: 1) System Setup and Data Encryption; ) Single Round MapReduce Based Privacy-preserving Clustering 3) Privacypreserving Clustering Center Update. In Stage 1, the owner first setups the system by selecting parameters for K-Means and MapReduce. The owner then generates encryption keys for the system, and encrypts the dataset for clustering. In Stage, the cloud server performs a round of clustering and allocates encrypted objects to their closet clustering centers. After that, the cloud server returns a small amount of encrypted information back to the owner as the intermediate outputs. In Stage 3, the owner updates clustering centers based on information from the cloud server and his/her own secret keys. These new centers are sent to the cloud server in encrypted format for the next round of clustering. Stage and Stage 3 will be iteratively executed until the clustering result does not change any more or the predefined number of iterations is reached. Dataset Owner Stage1: System Setup and Data Encryption 1. Parameter Selection. System Key Generation 3. Data Encryption Stage 3: Privacy-preserving Clustering Center Update 1. Decrypt Aggregated Ciphertexts. Update and Re-encrypt Clustering Centers Fig. 3. Scheme Overview Encrypted Dataset & Encrypted Intial Clustering Centers Aggregated Ciphertexts for each cluster Updated Encrypted Clustering Centers Iterative Process Untial the Clustering is Finished Cloud Server Stage : Single Round MapReduce Based Privacy-preserving Clustering We now give the detailed construction of each stage in our scheme. We summarize important notations used in our con-

4 4 n Total Number of Data Objects in the Dataset m Total Number of Elements in a Data Object K Total Number of Clusters D i, D i Extended Data Object Vectors C k Extended Clustering Centers e i, e k Random Noise Vectors W Weight Vector M, Q Random Invertible Matrices M 1, Q 1 Inverses of M and Q I Identity Matrix E( ) Encryption SUM k Aggregated Ciphertexts of Data Objects List k Clustering Result List TABLE I NOTATION struction in Table I, and define two mathematical operations as below. Definition IV.1. For a R, define a to be the nearest integer of a, a q to be the nearest integer of a with modulus q. Definition IV.. For a vector D (or a matrix M), define max( D) (or max(m) ) to be maximum absolute value of their elements. B. Detailed Construction 1) Stage 1 - System Setup and Data Encryption: In our system, we consider a dataset with n data objects, which needs to be clustered into K clusters. Each data object contains m elements, and elements in all objects will be scaled to integers with the same scale factor. We denote each scaled data object as {d i1, d i,, d im } Z p. Given a data object, the owner first extends it to two m-dimensional vector as D i = [r i d i1, r i d i,, r i d im, r i, α i1, α i,, α im 1 ] D i = [d i1, d i,, d im, r i, α i1, α i,, α im 1 ] where r i, α i1, α i,, α i(m 1) Z p are random numbers selected by the owner for each data object, and r i is positive. D i will be used for privacy-preserving clustering, and D i will be used for privacy-preserving update of clustering centers. The owner also selects K initial clustering centers and extends them to m-dimensional vectors C k, 1 k K as C k = [c k1, c k,, c km, 1 c kj, β 1, β,, β m 1 ] where β 1, β,, β m 1 Z p are random numbers selected by the owner and will be regenerated for each round of clustering. Note that, there are different ways of selecting the initial centers [30], and our design is independent to how initial centers are selected. Key Generation: The key generation of our scheme involves the selection of two random m m invertible matrices M and Q. We also use M 1 and Q 1 to denote the inverses of M and Q respectively, where M M 1 = I, Q Q 1 = I, and I is a m m identity matrix. SK = {M, Q, M 1, Q 1 } is set as secret key for the system, and is only known to the owner. Data Encryption: Given extended data vectors D i and D i of a data object, the owner encrypts them as E( D i ) = (Γ D i + e i ) M (1) E( D i) = (Γ D i + e i ) Q () Here, Γ Z q, q p, Γ max( e i ), e i Z m q is a random integer noise vector generated for each D i. Then, for each extended clustering center C k, the owner encrypts it as E( C k ) = M 1 (Γ C T k + e T k ) (3) where C k T and et k are column vectors of C k and e k respectively, e k Z m q is a random integer noise vector generated for each clustering center C k. Considering the support of MapReduce, ciphertexts of each data object are organized as a key-value pair < i, E( D i ) E( D i ) >, where the index of the data object (i) is used as the key and the concatenation of E( D i ) and E( D i ) is set as the value. All n key-value pairs < i, E( D i ) E( D i ) > 1 i n, K encrypted clustering centers {E( C k )} 1 k K, and the public parameter Γ are outsourced to the cloud server. ) Stage - Single Round MapReduce Based Privacypreserving Clustering: As described in Section III-A, the clustering purpose for a data object is to find the clustering center that has the minimum Euclidean distance to it. As different data objects are independent with each other in a single round of clustering, we set the clustering for an object as the MapReduce job in our scheme. Now, the first task is to achieve Euclidean distance comparison directly over encrypted data objects and clustering centers. Given an encrypted object E( D i ) and any two encrypted clustering centers E( C a ), E( C b ), the cloud server first computes Comp ia = E(D i ) E(C a ) Γ q c m 1 aj = (r i d ij c aj r i ) + α ij β j ) + e i e T k Γ D + i e T k + e i C k T Γ = r i d ij c aj r i q = r i (Dist{C a, D i } Comp ib = r i (Dist{C b, D i } m 1 c aj + α ij β j m 1 d ij) + α ij β j m 1 d ij) + α ij β j The correctness of the above equations are guaranteed by the distributive property of matrix multiplication and the fact that Γ p, Γ max( e i ) and Γ max( e k ). Based on Comp ia and Comp ib, the cloud can easily output the difference between Dist{C a, D i } and Dist{C b, D i } as Comp ia Comp ib = r i (Dist{C b, D i } Dist{C a, D i }) (5) (4)

5 5 As r i is positive, it is clear that the output of encrypted distance comparison in Eq.5 is consistent with the comparison using exact distance between Dist{C a, D i } and Dist{C b, D i }. If the sign of Eq.5 is positive, clustering center C a has smaller distance to the data object; otherwise, clustering center C b has smaller distance. To further process privacy-preserving K-means clustering with MapReduce, the cloud server splits and distributes n encrypted data pairs < i, E( D i ) E( D i ) > 1 i n to all mappers. The cloud server also sends all K encrypted clustering centers {E( C k )} 1 k K to each mapper. Afterwards, each mapper initiates K m-dimensional vectors {SUM k } 1 k K, in which all elements are 0, i.e, {0, 0,, 0}. These vectors are used to aggregate encrypted data objects allocated to the same center, and will be utilized for the update of clustering centers in Stage 3. The map function in our privacy-preserving K-means clustering is shown in Fig.4. Taking an encrypted data keyvalue pair < i, E( D i ) E( D i ) > and all encrypted clustering centers {E( C k } 1 k K as inputs, a mapper iteratively invokes our privacy-preserving distance comparison described above to figure out the closest center for the data object. The intermediate outputs in our map function include two parts: 1) a key-value pair < k, i > with the index of the closest center k as key and the index of data object i as value; ) updated SUM k for the k-th center. Once a mapper finishes all jobs assigned to it, it will organize its final SUM k as keyvalue pairs < k, SUM k > 1 k K. Finally, all outputs will be sent to reducers. Map Process of Privacy-preserving K-means Input: K encrypted centers {E( C k )} 1 k K, an encrypted data pair < i, E( D i ) E( D i ) >, aggregation of encrypted vectors {SUM k } 1 k K Output: Key-value pair < k, i > and updated SUM k, k is the index of clustering center assigned for data object D i. 1. index = 1;. mincandidate = Comp i1 //Computed as Eq For (k = ; k< K+1; k++){ If (Comp ik -mincandidate> 0){ //Compared as Eq.5. mincandidate = Comp ik ; index = k; } } 4. Output a < k, i > pair, SUM k = SUM k + E( D i ); Fig. 4. Map Process of Privacy-preserving K-means The reduce function is presented in Fig.5. On receiving outputs from mappers, reducers first add indexes of data objects allocated to the same cluster into their corresponding result list List k, where 1 k K. Reducers also aggregate partially aggregated SUM k from each mapper as { SUM k } 1 k K, whose final values are SUM k = i List k E( D i ) Based on the output of reducers, the cloud server checks whether the predefined number of clustering iterations is researched, or whether all result lists List k, 1 k K are the same as the previous round of clustering. If so, the clustering Reduce Process of Privacy-preserving K-means Input: < k, i > pairs for all data objects, partially aggregated < k, SUM k >, 1 k K from each mapper. Output: K Classified Clusters < k, List k >, final aggregated < k, SUM k >, 1 k K for all n data objects. 1. While(< k, i > pairs. hasnext()){ Add i to List k ; }. While(< k, SUM k > pairs. hasnext()){ SUM k = SUM k + SUM k } 3. Output the < k, List k >, SUM k, 1 k K; Fig. 5. Reduce Process of Privacy-preserving K-means is finished and the cloud server sends List k, 1 k K back to the dataset owner as the clustering result; otherwise, the cloud server sends { SUM k } 1 k K back to the owner for the update of clustering centers. To this end, a single round of privacy-preserving MapReduce based K-means clustering is finished. 3) Stage 3 - Privacy-preserving Clustering Center Update: After a single round of clustering, clustering centers need to be updated as described in Section III-A. Particularly, the m elements in for a new clustering center is calculated as mean values of elements in data objects currently allocated to the cluster, i.e. c kj = 1 List k i List k d ij, where 1 j m and List k is the total number of data objects in the k-th cluster. To efficiently achieve this in a privacy-preserving manner, our scheme first utilizes the MapReduce design in Stage to generate aggregated ciphertexts of data objects allocated in the same cluster. Note that, only K aggregated ciphertexts { SUM k } 1 k K need to be retrieved by the owner, where K is the total number of clusters and each ciphertext is a mdimensional vector. Thus, the communication overhead for the interaction after each round of clustering is lightweight and independent to the size of the dataset. With { SUM k } 1 k K, the owner decrypts them using the secret key Q 1 and the public parameter Γ as C k = SUM k (Q) 1 = D i i List k Γ q As shown in Eq.6, the decryption of an aggregated ciphertext outputs the aggregation of corresponding data object vectors according to the associative and distributive properties of matrix multiplication. Given C k for the k-th cluster, the owner generates new center C C k as k List k. During the update, A (6) the owner only keeps the first m elements of each C k, since all rest elements are extended values or random numbers as described in the data encryption process of Stage 1. After K new clustering centers C k are generated, the owner extends them to m-dimension vectors and encrypts them using the encryption process presented in Stage 1.

6 6 All new encrypted centers are sent to the cloud server for the next round of clustering as described in Stage. Stage and Stage 3 are iteratively executed until the clustering is finished. To this end, all required operations in a K-means clustering are supported in the privacy-preserving manner in our construction. C. Extension to Weighted K-means Clustering As presented in Section III-B, weighted K-means clustering is similar to the original K-means clustering with only one difference in distance computation. In weighted K-means clustering, the weighted distance is calculated as W Dist( D, C, W ) = m w j(d j c j ), where W = [w 1, w,, w m ] is the weight vector for the dataset. Thus, to support weighted K-means clustering in a privacy-preserving manner, our design should enables the privacy-preserving weighted distance comparison. In particular, we only need to make the following change to the Stage 1 of our design introduced in Section IV-B: involving weight values into K extended clustering center vectors as C k = [w 1 c k1,, w m c km, 1 w j c kj, β 1,, β m 1 ] These extended center vectors will be encrypted using the same way as that in our design for original K-means. We now show that such a change can make our design support privacy-preserving weighted distance comparison, and further lead to privacy-preserving MapReduce based weighted K-means clustering. Specifically, given ciphertexts of a data object E( D i ), we compute Comp ia and Comp ib for clustering centers C a and C b as Comp ia = E(D i ) E(C a ) = Γ q r i d ij w j c aj r i Comp ib = E(D i ) E(C b ) = Γ q r i d ij w j c bj r i (7) m 1 w j c aj + α ij β j m 1 w j c bj + α ij β j After that, the weighted distance comparison can be conducted as Comp ia Comp ib = r i (w j c bj w j c aj + w j (c aj c bj ) d ij ) = r i w j (d ij + c bj d ij c bj (d ij + c aj d ij c aj )) = r i (W Dist( D i, C b, W ) W Dist( D i, C a, W )) In addition, since our extension does not make any change to the processing and encryption of the dataset, the later clustering center update is the same as that in our design for original K-means clustering. To this end, the privacypreserving weighted K-means clustering based on MapReduce can be supported. D. Security Analysis In this section, we show the security of our design under the Ciphertext Only Model and Known Background Model as described in Section II-B. We first prove the security of encryption scheme for all data objects and clustering centers based on the hardness assumption of the Learning With Error (LWE) problem [], which guarantees polynomial-time adversaries are not able to recover the owner s data directly from their ciphertexts. Definition IV.3. Learning with Error (LWE) Problem Given polynomially many samples of ( a i Z m q, b i Z q ) with b i = D a T i + γ i where the error term γ i Z q is drawn from some probability distribution, it is computational infeasible to recover the vector D with non-negligible probability. Theorem IV.4. If the LWE problem is hard, then it is computable infeasible for a polynomial-time adversary to recover D i from its ciphertexts E( D i ) and E( D i ), Ck from its ciphertext E( C k ) in our proposed scheme. Proof. In our encryption, each extended D i is encrypted as E( D i ) = (Γ D i + e i ) M (8) Since D i and D i are encrypted with in the same manner, we use D i in our proof for expression simplicity. In E( D i ), as D i and e i are m-dimensional vectors, their multiplication with the m m matrix M can be considered as m dot products of m-dimensional vectors as follows E( D i )(1) = Γ D ia M(1) + e i M(1) T E( D i )(m) = Γ D ia M(m) + e i M(m) T where 1 j m, E( D i )(j) is the j th element of E( D i ), and M(j) is the j th column of M. By denoting Γ M(j) as M (j) and e i M(j) as e i, we have m samples (M (j), E( D i )(j)) with E( D i )(j) = D ia M (j) + e i, 1 j m Therefore, recovering D i from ciphertext E( D i ) (respectively, D i from E( D i )) becomes the LWE problem presented in Definition IV.3, which is considered as hard problem and computational infeasible. It is notable that as M in our scheme is the secret key and is not available to the adversary, M (j) is actually also not available to the adversary. Such a fact makes the recovering of D i in our design more difficult than the LWE problem. With regard to any C k and its ciphertext E( C k ), we can also convert its encryption to m dot products similarly as E( C k )(j) = M 1 (j) C T k + e k), 1 j m

7 7 where M 1 (j) = Γ M 1 (j), M 1 (j) is the j th row of M 1, and e k = M 1 (j) e T k. Thus, recovering C k from E( C k ) also becomes the LWE problem by given m samples (M 1 (j), E( C k )(j)) 1 j m. In addition, M 1 (j) is also not available to the adversary, since M 1 is the secret key and only known to the dataset owner. To this end, if the LWE problem is hard, recovering data objects and clustering centers from their corresponding ciphertexts is computational infeasible for a polynomially-time adversary. Theorem IV.4 is proved. As the cloud server only has access to ciphertexts in the Ciphertext Only Model, our proposed scheme is secure in this threat model according to our proof of Theorem IV.4. We now further analyze the security of our scheme in the Known Background Model. Known Background Model: In this threat model, besides the ciphertexts available in the Ciphertext Only Model, the cloud server also has access to a small set of data objects from background information and analysis. In the KN N problem, in which all data objects as well as query objects are independent with each other and the cloud server is possible to obtain some query objects without knowing other data objects from background information. Differently, clustering centers in the K means clustering setting are generated according to all data objects involved in the dataset on the fly, and will be updated after each round of clustering. Thus, we consider the cloud server will not obtain clustering centers from background information and analysis. As the security of our encryption is guaranteed under the LWE problem, the cloud server is not able to recover data objects as well as clustering centers directly from ciphertexts. We now focus on linear analysis attack introduced by Yao et al. [17]. Instead of recovering data objects and clustering centers directly from their ciphertexts, this kind of attack attempts to recover the data from the Euclidean distance comparison results as shown in Eq.5. Specifically, given the comparison result equation of one data object and any two clustering centers, the cloud server can construct an equation Rst iab = r i (Dist{C b, D i } Dist{C a, D i }) In this equation, there are 3m + 1 unknowns from C a, Cb, D i, and r i. As the cloud has access to a small set of data objects, it can reduce the number of unknowns to m+1 (from C a, Cb and r i ) in an equation Rst iab if D i is in its known set of data objects. The original idea of the linear analysis attack is: if the cloud can obtain more than m data objects, it can construct m Rst iab equations for m unknowns in C a, C b, and solve them to recover C a and C b. However, such an attack cannot work in our design. This is because we embed different random numbers r i for each data object Di. With such a design, each additional equation Rst iab constructed from a newly obtained D i also introduces an unknown r i in this equation, and thus bring no contribution for recovering C a, C b. Therefore, when the cloud obtains m data objects, it can only construct m equations for solving 4m unknowns from C a, Cb, and m random numbers r i, which are unsolvable using linear analysis. To this end, our scheme can prevent the cloud server from learning data objects as well as clustering centers in the Known Background Model. A. Numerical Evaluation V. EVALUATION In this section, we numerically analyze our proposed scheme in terms of computational cost, communication overhead, and storage overhead. We also compare the cost of our scheme with the original K-means algorithm, and summarize the result in Table II. For expression simplicity, we use DOT m to denote one m-dimensional dot product operation in the rest parts of this paper. In particular, given two m-dimensional vectors A = [a1, a,, a m ] and B = [b1, b,, b m ], a DOT m operation for them is A B T = m = a j b j. We ignore single addition operation and single modular operation in our evaluation, since their cost are negligible compared to the DOT m operation. 1) Computational Cost: In our scheme, the dataset owner is responsible for key generation, dataset encryption, and clustering center update. As the key generation process only involves one-time selection of two random m m invertible matrices. To encrypt a data object with m elements, the owner needs to perform 4m DOT m operations as shown in Eq.1 and Eq.. Similarly, the encryption cost for a clustering center is m DOT m operations as shown in Eq.3. Therefore, given a n objects dataset and K clusters, the owner needs (4mn+mK) DOT m operations for one-time encryption. For each round of clustering center update, the owner first needs mk DOT m operations to decrypt intermediate outputs from the cloud server as shown in Eq.6. Then, another mk DOT m operations are needed to re-encrypt K updated clustering centers. Therefore, the total computational cost on the owner for a round of clustering is 4mK DOT m operations. With regard to the cloud server in each round of clustering, it needs K DOT m operations as shown in Eq.7 to allocate a data object into the closest center. Thus, to process n objects to K clusters in a single round of clustering, the computational cost on the cloud server is nk DOT m operations. Compared to the original K-means clustering that presented in Section III-A, the cloud server needs to compute K squares of Euclidean distance to allocate a data object in each round of clustering. Since the square of Euclidean distance between two m- dimensional vectors contains the same number of addition and multiplication operations as in a DOT m operation, we represent each square of Euclidean distance as a DOT m operation. To allocate n objects to K clusters in a round of clustering, the of original K-means clustering algorithm requires nk DOT m operations on the cloud server. Therefore, our scheme has the same computational complexity on the cloud server for each round of clustering. ) Communication Overhead: The communication overhead in our scheme mainly comes the interaction after each round of clustering. Specifically, if the clustering is not finished, the cloud server needs to send back K aggregated ciphertexts { SUM k } 1 k K, which are m-dimensional vectors. After that, the owner needs to return K ciphertexts {E( C ) k } 1 k K for updated clustering centers, which are

8 8 Computational Complexity (Cloud Server) Computational Complexity (Owner) Communication Overhead Our Scheme nk DOT m 4mK DOT m 4mK vector elements Original K-means nk DOT m N/A N/A *n: the number of data objects in the dataset; m: the number of elements in a data object; K: the number of clustering centers; Typically, n >> m, k TABLE II NUMERICAL ANALYSIS FOR A SINGLE ROUND OF CLUSTERING also m-dimensional vectors. Thus the total communication cost in each round of clustering are 4mK vector elements, each of which is 8 Bytes in our implementation. It is notable that the interaction cost for each round of clustering is independent to size of the dataset, which makes our scheme scalable for large-scale datasets. Although the communication overhead is still linear to the number of clustering centers, it is typically a small number in a practical clustering. 3) Storage Overhead: The storage overhead is introduced in the encryption of dataset and clustering centers. In our scheme, each data object and clustering center are denoted as a m-dimensional vector, and will be encrypted as two mdimensional vectors. Thus, the total storage cost in our scheme is four times to that of the unprotected clustering. 4) Comparison with ref [16]: In this section, we also compare our proposed encryption scheme with that in ref [16], and summarize the results in Table III. As ref [16] is not designed for K means, our comparison focus on the major operation in ref [16], i.e., privacy-preserving Euclideandistance comparison. Specifically, given a data object vector with m elements, we assume both our scheme and ref [16] extend it to m elements by adding random elements (artificial elements in [16] respectively). To encrypt the extended vector, say D, ref [16] first split it into two vector D A and D B, and then encrypt them as two m-dimensional vectors E( D A ) and E( D B ) using matrix multiplication. Differently, our scheme directly encrypts D and outputs only one mdimensional vector E( D) as ciphertext. As all ciphertexts will be outsourced to the cloud server for further process, it is clear that the storage overhead introduced in ref [16] is twice of that of our scheme. With regard to the Euclidean distance comparison process on cloud, i.e., figuring out the vector from n encrypted vectors that has the smallest Euclidean distance to an encrypted request vector, ref [16] requires n DOT m operations. Differently, our proposed scheme only requires n DOT m operations. This is because ref [16] needs process two ciphertexts of each vector in the dataset, while only one is necessary in our scheme. Similarly, to encrypt a request vector, ref [16] needs two DOT m operations on the dataset owner, but our scheme only requires one DOT m operation. Therefore, compared with ref [16], our scheme saves about 50% computational cost on the cloud server and data owner for privacy-preserving Euclidean distance comparison. B. Experimental Evaluation 1) Experiment Configuration: To evaluate the performance of our privacy-preserving MapReduce based K-means clustering scheme in terms of efficiency, scalability, and accuracy, Our Scheme Ref [16] Computational Cost n DOT m n DOT m (Cloud Server) Computational Cost DOT m DOT m (Owner) Storage Overhead nm vector elements 4nm vector elements TABLE III NUMERICAL ANALYSIS OF PRIVACY-PRESERVING EUCLIDEAN DISTANCE COMPARISON we implemented a prototype on Microsoft Azure cloud [31] using JAVA 1.7. We deployed a cluster of 6 to 10 nodes with Apache Hadoop.6.3 [3] installed. Two nodes are used as head nodes and the other to 8 are used as worker nodes. Each node is running Ubuntu Linux 1.04 with four.40ghz CPU cores and 14GB memory. The local machine for the dataset owner is a desktop running OS X with 8 3.3GHz CPU cores and 8GB memory. To support matrix related operations in our scheme, jblas library 1..4 [33] is adopted in the implementation. The dataset used in our evaluation consists of 5 million simulated data objects. Each object has 10 elements, and can be represented as a 10-dimensional vector. These objects need to be allocated into 10 clusters. To demonstrate that our scheme introduces reasonable computation and communication overhead for privacy guarantee, we also implemented a non-privacy-preserving MapReduce based K-means under the same configuration. All experimental results represent the mean of 10 trials. Encryption Cost (Second) Number of Data Objects (million) Fig. 6. Dataset Encryption Cost ) System Setup: As discussed in Section V-A1, the system setup cost mainly comes the dataset encryption by the dataset owner. As shown in Fig.6, the dataset encryption cost increases

9 9 linearly from 1.6s to 6.34s when we change the size of dataset from 1 million objects to 5 million objects. Note that, this is one-time cost in our scheme, and will not affect the later clustering performance. 3) Efficiency: In our evaluation, we focus on evaluating the efficiency of a single round clustering, because different rounds of clusterings have the same computational cost on the owner and the cloud server as shown in Section V-A1. In addition, the number of clustering rounds is mainly determined by the dataset itself and the selection of initial clustering centers, and is independent to the design our scheme. Single Round Clustering Cost (second) Our Scheme - Cloud Server Cost Our Scheme - Total Cost Non-Privacy-preserving Solution Number of Data Objects (million) Fig. 7. Computational Cost on for a Single Round of Clustering By using 4 worker nodes, Fig.7 shows the cloud server spends 5.05s to 10.s to perform a single round of clustering over datasets from 1 million to 5 million. Compared with the non-privacy-preserving version, our scheme only brings within 35% more computational cost on cloud for a single round of clustering. This is consistent with our analysis in Section V-A1, since our scheme achieves the same computational complexity on the cloud server compared to that of a nonprivacy-preserving design. After each round of clustering, the dataset owner needs to update 10 clustering centers, which only costs 65ms. The total communication overhead after each round of clustering is 3.KB 1, in which 1.6KB are aggregated ciphertexts returned by the cloud server and the other 1.6KB are updated clustering centers uploaded by the owner. It is notable that our communication overhead is independent to the size of the dataset as shown in Fig. 8. This decent feature also promotes the scalability of our scheme for large-scale datasets. Using an 100Mb bandwidth Internet in our experiment, the communication after each round of clustering spends 1.34s. Therefore, the total cost for a single round of clustering starts from 6.39s to 11.56s when the size of dataset varies from 1 million to 5 million as shown in Fig. 7. 4) Scalability: We evaluate the scalability of our scheme with respect to scaleup [34]. Specifically, scaleup is the ability of using m-times larger resources to perform a m-times larger job in the same running time as the original job. Thus, 1 Each element in a data object is formated as Long in Java, which is 8 Bytes. Single Round Communication Overhead (KB) Number of Data Objects (million) Fig. 8. Communication Overhead for a Single Round of Clustering given a original job time T, the scaleup rate is defined as % of Job F inished in T 100%. In our evaluation, the original job is set as the clustering over 1 million objects using two worker nodes. We then increase the number of worker nodes to 4, 6, 8 and the number of data objects as million, 3 million, and 4 million respectively. As demonstrated in Fig.9, -times, 3-times, and 4-times scaleups in our scheme have scaleup rate at 0.9, 0.84, and 0.73 respectively, which is comparable with the scalability of the non-privacy-preserving MapReduce based K-means clustering [19]. Scaleup Rate m-times Fig. 9. Scaleup Evaluation Our Scheme Non-Privacy-preserving Solution 5) Accuracy: Compared with the original K-means clustering algorithms, our scheme does not introduce any accuracy loss if all initial clustering centers are selected in the same way. In particular, the allocation of a data object to the closest center is determined by the Euclidean distance between the object and the center. As discussed in Stage of our scheme, the Euclidean comparison result over encrypted data in our scheme is exactly the same as that over unprotected data. Moreover, the update of clustering centers is also the same as that in the original K-means clustering. Therefore, our scheme achieves the same accuracy compared with the original K-

10 10 Our Scheme Original K-means Cluster # Iterations = TABLE IV ACCURACY COMPARISON means clustering. As a proof of concept, we perform a 100- round clustering over one million data objects. The clustering results of our scheme and the original K-means algorithm are compared by the number of data objects in each cluster after the same number of clustering rounds. As shown in Table IV, our scheme has the same clustering results as the original K- means after 100 rounds of clustering. VI. RELATED WORK A. Privacy-preserving Clustering In recent years, a number of schemes have been proposed to outsource clustering tasks in a privacy-preserving manner. In ref [10], [11], distance-preserving data perturbation or data transformation techniques are adopted to protect the privacy of dataset, while keeping the distance comparison property for clustering purpose. These perturbation based techniques are very efficient and even achieve the same computational cost compared to the original clustering algorithm. This is because data perturbation based encryption makes the ciphertxt have the same size as the original data, and uses the same clustering operations in the original clustering algorithm. However, as shown in ref [1], [13], these data perturbation based solutions do not provide enough privacy guarantee. Specifically, once adversaries get a small set of unencrypted data objects in the dataset from background analysis, they will be able to recover the rest objects [1]. To provide strong privacy guarantee, novel cryptographic primitives are adopted in privacy-preserving clustering outsourcing. In ref [14], a privacy-preserving outsourcing design for K-means clustering is proposed by utilizing homomophic encryption and order preserving indexes. Nevertheless, as shown in ref [15], the homomophic encryption adopted in ref [14] is not secure. Moreover, ref [14] is efficient only for small datasets, e.g., less than 50,000 data objects. As a comparison, ref [14] requires seconds for a single round clustering over only 30,000 data objects, while our proposed scheme can achieve a single round clustering over 5 million data objects within 15 seconds as evaluated in Section V-B3. The other promising candidate for privacy-preserving outsourcing of K-means clustering is the secure outsourcing of general linear algebra computations [35], since all required operations in K-means clustering can be converted to linear algebra computations. However, general secure computation outsourcing mainly focuses on one-round computation, while K-means clustering is an iterative process and needs the update of ciphertext for each round. The problem of privacy-preserving clustering has also been studied in the distributed setting [4] [9]. These schemes mainly rely on multi-party secure computation techniques, such as secure circuit evaluation, homomorphic encryption and oblivious transfer. Nevertheless, privacy-preserving distributed clustering has a different purpose with privacy-preserving outsourcing of clustering. These designs involve multiple entities, which perform clustering over their shared data without disclosing their data to each other. Differently, the dataset in clustering outsourcing is owned by a single entity, who wants to minimize local computational cost for large-scale clustering. Another line of research that is related to this work is privacy-preserving KNN search, since both K-means and KNN use Euclidean distance to measure the similarity of data vectors. An efficient matrix based privacy-preserving KNN search scheme is first proposed by Wong et al. [16], in which they convert the Euclidean distance comparison to scalar product computation. Nevertheless, as demonstrated by Yao et al. [17], ref [16] is vulnerable to the linear analysis attack when the cloud server obtains a set of data objects from the dataset. To overcome such a security vulnerability, Yao et al. [17] present a secure solution by adopting a novel partition-based secure Voronoi diagram design. Unfortunately, their scheme only supports data with no more than two dimensions, and thus becomes impractical for most types of data in the domain of clustering. Recently, Su et al. [18] further modify ref [16] by adding a random noise term for each data vector, which leads their scheme to resist the linear analysis attack. However, their design also sacrifices the search accuracy to some extent, since the added noise terms will be included in the Euclidean distance comparison. Differently, our proposed scheme can support data of any number of dimensions, is resistant to linear analysis attacks as shown in Section IV-D, and does not introduce any accuracy loss. Furthermore, considering privacypreserving Euclidean distance comparison only, our scheme also significantly reduce computational cost and storage overhead compared with ref [16] as discussed in Section V-A4. In addition, extending privacy-preserving KNN to support the outsourcing of K-means clustering is not a trivial task. Unlike the KNN search that is a single round task, K-means clustering is an iterative process and requires the update of clustering centers based on all objects in the dataset after each round of clustering. To guarantee the efficiency and privacy of the entire clustering process, our scheme uniquely makes these updates compatible with MapReduce and allows them to be mainly handled by the cloud server over ciphertexts. Particularly, the dataset owner only needs to perform a constant number of operations for the update of clustering centers as evaluated in Section V, which is independent to the size of the large-scale dataset. B. MapReduce Based K-means Clustering To efficiently perform K-means clustering over large-scale dataset, MapReduce framework [3] has been frequently adopted by researchers [19] [1]. In ref [19], a fast parallel k- means clustering algorithm based on MapReduce is proposed.

11 11 Later on, ref [0] parallelized the initial phase of an existing efficient K-means algorithm and implemented it using MapReduce. After that, the performance of MapReduce based K- means is further optimized for large-scale dataset by ref [1]. Similarity, the MapReduce is adopted in ref [36] for efficient earth movers distance similarity join over large-scale datasets. Nevertheless, all these designs only focus on improving the computational performance over large-scale datasets, and none of them take privacy protection into consideration. VII. CONCLUSION In this work, we proposed a privacy-preserving MapReduce based K-means clustering scheme in cloud computing. Thanks to our light-weight encryption design based on the LWE hard problem, our scheme achieves clustering speed and accuracy that are comparable to the K-means clustering without privacyprotection. Considering the support of large-scale dataset, we securely integrated MapReduce framework into our design, and make it extremely suitable for parallelized processing in cloud computing environment. In addition, the privacypreserving Euclidean distance comparison component proposed in our design can also be used as an independent tool for distance based applications. We provide thorough analysis to show the security and efficiency of our scheme. Our prototype implementation over 5 million data objects demonstrates that our scheme is efficient, scalable, and accurate for K-means clustering over large-scale dataset. REFERENCES [1] European Network and Information Security Agency. Cloud computing security risk assessment. [] Darcy A. Davis, Nitesh V. Chawla, Nicholas Blumm, Nicholas Christakis, and Albert-László Barabasi. Predicting individual disease risk based on medical history. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 08, pages , Napa Valley, California, USA, 008. [3] U.S. Dept. of Health & Human Services. Standards for privacy of individually identifiable health information, final rule, 45 cfr, pt [4] Jaideep Vaidya and Chris Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 03, pages 06 15, New York, NY, USA, 003. ACM. [5] Geetha Jagannathan and Rebecca N. Wright. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD 05, pages , New York, NY, USA, 005. ACM. [6] Paul Bunn and Rafail Ostrovsky. Secure two-party k-means clustering. In Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS 07, pages , New York, NY, USA, 007. ACM. [7] Mahir Can Doganay, Thomas B. Pedersen, Yücel Saygin, Erkay Savaş, and Albert Levi. Distributed privacy preserving k-means clustering with additive secret sharing. In Proceedings of the 008 International Workshop on Privacy and Anonymity in Information Society, PAIS 08, pages 3 11, New York, NY, USA, 008. ACM. [8] Jun Sakuma and Shigenobu Kobayashi. Large-scale k-means clustering with user-centric privacy-preservation. Knowledge and Information Systems, 5():53 79, 009. [9] Xun Yi and Yanchun Zhang. Equally contributory privacy-preserving k-means clustering over vertically partitioned data. Inf. Syst., 38(1):97 107, March 013. [10] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. SIGMOD Rec., 9(): , May 000. [11] Stanley R. M. Oliveira and Osmar R. Zaane. Privacy preserving clustering by data transformation. In Brazilian Symposium on Databases, SBBD, Manaus, Amazonas, Brazil, 003. [1] Kun Liu, Chris Giannella, and Hillol Kargupta. An attacker s view of distance preserving maps for privacy preserving data mining. In Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases, PKDD 06, pages , Berlin, Heidelberg, 006. Springer-Verlag. [13] H. Kargupta, S. Datta, Q. Wang, and Krishnamoorthy Sivakumar. On the privacy preserving properties of random data perturbation techniques. In Data Mining, 003. ICDM 003. Third IEEE International Conference on, pages , Nov 003. [14] Dongxi Liu, Elisa Bertino, and Xun Yi. Privacy of outsourced k-means clustering. In Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, ASIA CCS 14, pages , New York, NY, USA, 014. ACM. [15] Yongge Wang. Notes on two fully homomorphic encryption schemes without bootstrapping. Cryptology eprint Archive, Report 015/519, 015. [16] Wai Kit Wong, David Wai-lok Cheung, Ben Kao, and Nikos Mamoulis. Secure knn computation on encrypted databases. In Proceedings of the 009 ACM SIGMOD International Conference on Management of data, SIGMOD 09, pages , New York, NY, USA, 009. ACM. [17] B. Yao, F. Li, and X. Xiao. Secure nearest neighbor revisited. In Data Engineering (ICDE), 013 IEEE 9th International Conference on, pages , April 013. [18] Sen Su, Yiping Teng, Xiang Cheng, Yulong Wang, and Guoliang Li. Privacy-preserving top-k spatial keyword queries over outsourced database. In Proceedings of the 0th International Conference on Database Systems for Advanced Applications, DASFAA 15, pages , 015. [19] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In Proceedings of the 1st International Conference on Cloud Computing, CloudCom 09, pages , Berlin, Heidelberg, 009. Springer-Verlag. [0] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. 5(7):6 633, March 01. [1] Xiaoli Cui, Pingfei Zhu, Xin Yang, Keqiu Li, and Changqing Ji. Optimized big data k-means clustering using mapreduce. J. Supercomput., 70(3): , December 014. [] Zvika Brakerski, Craig Gentry, and Shai Halevi. Packed ciphertexts in lwe-based homomorphic encryption. In 16th International Conference on Practice and Theory in Public-Key Cryptography (PKC), pages 1 13, February 013. [3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1): , January 008. [4] Sabrina De Capitani di Vimercati, Sara Foresti, Sushil Jajodia, Stefano Paraboschi, and Pierangela Samarati. Over-encryption: management of access control evolution on outsourced data. In Proceedings of the 33rd international conference on Very large data bases, VLDB 07, pages VLDB Endowment, 007. [5] Jiawei Yuan and Shucheng Yu. Privacy preserving back-propagation neural network learning made practical with cloud computing. IEEE Transactions on Parallel and Distributed Systems, 5(1):1 1, 014. [6] Ning Cao, Zhenyu Yang, Cong Wang, Kui Ren, and Wenjing Lou. Privacy-preserving query over encrypted graph-structured data in cloud computing. In Distributed Computing Systems (ICDCS), st International Conference on, pages , 011. [7] Jiawei Yuan and Shucheng Yu. Efficient Privacy-Preserving biometric identification in cloud computing. In 013 Proceedings IEEE INFOCOM (INFOCOM 013), pages , Turin, Italy, April 013. [8] Jiawei Han and Micheline Kamber. Chapter 7, Data Mining: Concepts and Techniques, Second Edition. Morgan Kaufmann, 006. [9] Ackerman Margareta, Ben-David Shai, Branzei Simina, and Loker David. Weighted clustering. pages , 01. [30] Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. J. ACM, 59(6):8:1 8:, January 013. [31] Microsoft azure cloud. [3] Apache hadoop. [33] Mikio L. Braun. jblas library. [34] Xiaowei Xu, Jochen Jäger, and Hans-Peter Kriegel. A fast parallel clustering algorithm for large spatial databases. Discov., 3(3):63 90, September Data Min. Knowl.

12 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI /TCC , IEEE 1 [35] Mikhail J. Atallah and Keith B. Frikken. Securely outsourcing linear algebra computations. In Proceedings of the 5th ACM Symposium on Information, Computer and Communications Security, ASIACCS 10, pages 48 59, New York, NY, USA, 010. ACM. [36] J. Huang, R. Zhang, R. Buyya, and J. Chen. Melody-join: Efficient earth mover s distance similarity joins using mapreduce. In 014 IEEE 30th International Conference on Data Engineering, pages , March 014. Jiawei Yuan (S 11-M 15) is an assistant professor of Computer Science in the Dept. of ECSSE at Embry-Riddle Aeronautical University since 015. He received his Ph.D in 015 from University of Arkansas at Little Rock, and a BS in 011 from University of Electronic Science and Technology of China. His research interests are in the areas of cyber-security and privacy in cloud computing and big data, ehealth security, and applied cryptography. He is a member of IEEE. Yifan Tian (S 16) is a Ph.D student at EmbryRiddle Aeronautical University since 016. He received his M.S in 015 from Johns Hopkins University, and BS in 014 from Tongji University, China. His research interests are in the areas of cyber security and network security, with current focus on secure computation outsourcing. He is a student member of IEEE.

PassBio: Privacy-Preserving User-Centric Biometric Authentication

PassBio: Privacy-Preserving User-Centric Biometric Authentication 1 PassBio: Privacy-Preserving User-Centric Biometric Authentication Kai Zhou and Jian Ren arxiv:1711.04902v1 [cs.cr] 14 Nov 2017 Abstract The proliferation of online biometric authentication has necessitated

More information

Evaluating Private Information Retrieval on the Cloud

Evaluating Private Information Retrieval on the Cloud Evaluating Private Information Retrieval on the Cloud Casey Devet University ofwaterloo cjdevet@cs.uwaterloo.ca Abstract The goal of Private Information Retrieval (PIR) is for a client to query a database

More information

Conjunctive Keyword Search with Designated Tester and Timing Enabled Proxy Re-Encryption Function for Electronic Health Cloud

Conjunctive Keyword Search with Designated Tester and Timing Enabled Proxy Re-Encryption Function for Electronic Health Cloud Conjunctive Keyword Search with Designated Tester and Timing Enabled Proxy Re-Encryption Function for Electronic Health Cloud Mrs. Rashi Saxena 1, N. Yogitha 2, G. Swetha Reddy 3, D. Rasika 4 1 Associate

More information

Efficient Privacy-Preserving Biometric Identification in Cloud Computing

Efficient Privacy-Preserving Biometric Identification in Cloud Computing Efficient Privacy-Preserving Biometric Identification in Cloud Computing Jiawei Yuan University of Arkansas at Little Rock, USA Email: jxyuan@ualr.edu Shucheng Yu University of Arkansas at Little Rock,

More information

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Partition Based Perturbation for Privacy Preserving Distributed Data Mining BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 2 Sofia 2017 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2017-0015 Partition Based Perturbation

More information

Secure Conjunctive Keyword Ranked Search over Encrypted Cloud Data

Secure Conjunctive Keyword Ranked Search over Encrypted Cloud Data Secure Conjunctive Keyword Ranked Search over Encrypted Cloud Data Shruthishree M. K, Prasanna Kumar R.S Abstract: Cloud computing is a model for enabling convenient, on-demand network access to a shared

More information

Improved Multi-Dimensional Meet-in-the-Middle Cryptanalysis of KATAN

Improved Multi-Dimensional Meet-in-the-Middle Cryptanalysis of KATAN Improved Multi-Dimensional Meet-in-the-Middle Cryptanalysis of KATAN Shahram Rasoolzadeh and Håvard Raddum Simula Research Laboratory {shahram,haavardr}@simula.no Abstract. We study multidimensional meet-in-the-middle

More information

Improved Multi-Dimensional Meet-in-the-Middle Cryptanalysis of KATAN

Improved Multi-Dimensional Meet-in-the-Middle Cryptanalysis of KATAN Improved Multi-Dimensional Meet-in-the-Middle Cryptanalysis of KATAN Shahram Rasoolzadeh and Håvard Raddum Simula Research Laboratory Abstract. We study multidimensional meet-in-the-middle attacks on the

More information

On Data Parallelism of Erasure Coding in Distributed Storage Systems

On Data Parallelism of Erasure Coding in Distributed Storage Systems On Data Parallelism of Erasure Coding in Distributed Storage Systems Jun Li, Baochun Li Department of Electrical and Computer Engineering, University of Toronto, Canada {junli, bli}@ece.toronto.edu Abstract

More information

Security Analysis of PSLP: Privacy-Preserving Single-Layer Perceptron Learning for e-healthcare

Security Analysis of PSLP: Privacy-Preserving Single-Layer Perceptron Learning for e-healthcare Security Analysis of PSLP: Privacy-Preserving Single-Layer Perceptron Learning for e-healthcare Jingjing Wang 1, Xiaoyu Zhang 1, Jingjing Guo 1, and Jianfeng Wang 1 1 State Key Laboratory of Integrated

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Securing Distributed Computation via Trusted Quorums. Yan Michalevsky, Valeria Nikolaenko, Dan Boneh

Securing Distributed Computation via Trusted Quorums. Yan Michalevsky, Valeria Nikolaenko, Dan Boneh Securing Distributed Computation via Trusted Quorums Yan Michalevsky, Valeria Nikolaenko, Dan Boneh Setting Distributed computation over data contributed by users Communication through a central party

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Secure Multiparty Computation

Secure Multiparty Computation CS573 Data Privacy and Security Secure Multiparty Computation Problem and security definitions Li Xiong Outline Cryptographic primitives Symmetric Encryption Public Key Encryption Secure Multiparty Computation

More information

A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data

A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data An Efficient Privacy-Preserving Ranked Keyword Search Method Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

A Review on Privacy Preserving Data Mining Approaches

A Review on Privacy Preserving Data Mining Approaches A Review on Privacy Preserving Data Mining Approaches Anu Thomas Asst.Prof. Computer Science & Engineering Department DJMIT,Mogar,Anand Gujarat Technological University Anu.thomas@djmit.ac.in Jimesh Rana

More information

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Airavat: Security and Privacy for MapReduce Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Computing in the year 201X 2 Data Illusion of

More information

Efficient Auditable Access Control Systems for Public Shared Cloud Storage

Efficient Auditable Access Control Systems for Public Shared Cloud Storage Efficient Auditable Access Control Systems for Public Shared Cloud Storage Vidya Patil 1, Prof. Varsha R. Dange 2 Student, Department of Computer Science Dhole Patil College of Engineering, Pune, Maharashtra,

More information

Enhancing Cloud Resource Utilisation using Statistical Analysis

Enhancing Cloud Resource Utilisation using Statistical Analysis Institute of Advanced Engineering and Science International Journal of Cloud Computing and Services Science (IJ-CLOSER) Vol.3, No.1, February 2014, pp. 1~25 ISSN: 2089-3337 1 Enhancing Cloud Resource Utilisation

More information

Efficient Private Information Retrieval

Efficient Private Information Retrieval Efficient Private Information Retrieval K O N S T A N T I N O S F. N I K O L O P O U L O S T H E G R A D U A T E C E N T E R, C I T Y U N I V E R S I T Y O F N E W Y O R K K N I K O L O P O U L O S @ G

More information

Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining

Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining CS573 Data Privacy and Security Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining Li Xiong Slides credit: Chris Clifton, Purdue University; Murat Kantarcioglu, UT

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters. Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

A Machine Learning Approach to Privacy-Preserving Data Mining Using Homomorphic Encryption

A Machine Learning Approach to Privacy-Preserving Data Mining Using Homomorphic Encryption A Machine Learning Approach to Privacy-Preserving Data Mining Using Homomorphic Encryption Seiichi Ozawa Center for Mathematical Data Science Graduate School of Engineering Kobe University 2 What is PPDM?

More information

Cloud security is an evolving sub-domain of computer and. Cloud platform utilizes third-party data centers model. An

Cloud security is an evolving sub-domain of computer and. Cloud platform utilizes third-party data centers model. An Abstract Cloud security is an evolving sub-domain of computer and network security. Cloud platform utilizes third-party data centers model. An example of cloud platform as a service (PaaS) is Heroku. In

More information

A Parallel Algorithm for Finding Sub-graph Isomorphism

A Parallel Algorithm for Finding Sub-graph Isomorphism CS420: Parallel Programming, Fall 2008 Final Project A Parallel Algorithm for Finding Sub-graph Isomorphism Ashish Sharma, Santosh Bahir, Sushant Narsale, Unmil Tambe Department of Computer Science, Johns

More information

Privacy Preserving Collaborative Filtering

Privacy Preserving Collaborative Filtering Privacy Preserving Collaborative Filtering Emily Mu, Christopher Shao, Vivek Miglani May 2017 1 Abstract As machine learning and data mining techniques continue to grow in popularity, it has become ever

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

IMPROVING DATA SECURITY USING ATTRIBUTE BASED BROADCAST ENCRYPTION IN CLOUD COMPUTING

IMPROVING DATA SECURITY USING ATTRIBUTE BASED BROADCAST ENCRYPTION IN CLOUD COMPUTING IMPROVING DATA SECURITY USING ATTRIBUTE BASED BROADCAST ENCRYPTION IN CLOUD COMPUTING 1 K.Kamalakannan, 2 Mrs.Hemlathadhevi Abstract -- Personal health record (PHR) is an patient-centric model of health

More information

M 2 R: ENABLING STRONGER PRIVACY IN MAPREDUCE COMPUTATION

M 2 R: ENABLING STRONGER PRIVACY IN MAPREDUCE COMPUTATION 1 M 2 R: ENABLING STRONGER PRIVACY IN MAPREDUCE COMPUTATION TIEN TUAN ANH DINH, PRATEEK SAXENA, EE-CHIEN CHANG, BENG CHIN OOI, AND CHUNWANG ZHANG, NATIONAL UNIVERSITY OF SINGAPORE PRESENTED BY: RAVEEN

More information

Framework Research on Privacy Protection of PHR Owners in Medical Cloud System Based on Aggregation Key Encryption Algorithm

Framework Research on Privacy Protection of PHR Owners in Medical Cloud System Based on Aggregation Key Encryption Algorithm Framework Research on Privacy Protection of PHR Owners in Medical Cloud System Based on Aggregation Key Encryption Algorithm Huiqi Zhao 1,2,3, Yinglong Wang 2,3*, Minglei Shu 2,3 1 Department of Information

More information

Accountability in Privacy-Preserving Data Mining

Accountability in Privacy-Preserving Data Mining PORTIA Privacy, Obligations, and Rights in Technologies of Information Assessment Accountability in Privacy-Preserving Data Mining Rebecca Wright Computer Science Department Stevens Institute of Technology

More information

S. Indirakumari, A. Thilagavathy

S. Indirakumari, A. Thilagavathy International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 A Secure Verifiable Storage Deduplication Scheme

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

How to Break and Repair Leighton and Micali s Key Agreement Protocol

How to Break and Repair Leighton and Micali s Key Agreement Protocol How to Break and Repair Leighton and Micali s Key Agreement Protocol Yuliang Zheng Department of Computer Science, University of Wollongong Wollongong, NSW 2522, AUSTRALIA yuliang@cs.uow.edu.au Abstract.

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Privacy-preserving distributed clustering

Privacy-preserving distributed clustering Erkin et al. EURASIP Journal on Information Security 2013, 2013:4 RESEARC Open Access Privacy-preserving distributed clustering Zekeriya Erkin 1*, Thijs Veugen 1,2, Tomas Toft 3 and Reginald L Lagendijk

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Introduction to Cryptography and Security Mechanisms: Unit 5. Public-Key Encryption

Introduction to Cryptography and Security Mechanisms: Unit 5. Public-Key Encryption Introduction to Cryptography and Security Mechanisms: Unit 5 Public-Key Encryption Learning Outcomes Explain the basic principles behind public-key cryptography Recognise the fundamental problems that

More information

ISA 562: Information Security, Theory and Practice. Lecture 1

ISA 562: Information Security, Theory and Practice. Lecture 1 ISA 562: Information Security, Theory and Practice Lecture 1 1 Encryption schemes 1.1 The semantics of an encryption scheme. A symmetric key encryption scheme allows two parties that share a secret key

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017 RESEARCH ARTICLE OPEN ACCESS Optimizing Fully Homomorphic Encryption Algorithm using Greedy Approach in Cloud Computing Kirandeep Kaur [1], Jyotsna Sengupta [2] Department of Computer Science Punjabi University,

More information

CS573 Data Privacy and Security. Cryptographic Primitives and Secure Multiparty Computation. Li Xiong

CS573 Data Privacy and Security. Cryptographic Primitives and Secure Multiparty Computation. Li Xiong CS573 Data Privacy and Security Cryptographic Primitives and Secure Multiparty Computation Li Xiong Outline Cryptographic primitives Symmetric Encryption Public Key Encryption Secure Multiparty Computation

More information

Implementation of 5PM(5ecure Pattern Matching) on Android Platform

Implementation of 5PM(5ecure Pattern Matching) on Android Platform Implementation of 5PM(5ecure Pattern Matching) on Android Platform Overview - Main Objective: Search for a pattern on the server securely The answer at the end -> either YES it is found or NO it is not

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Delegated Access for Hadoop Clusters in the Cloud

Delegated Access for Hadoop Clusters in the Cloud Delegated Access for Hadoop Clusters in the Cloud David Nuñez, Isaac Agudo, and Javier Lopez Network, Information and Computer Security Laboratory (NICS Lab) Universidad de Málaga, Spain Email: dnunez@lcc.uma.es

More information

Volume 6, Issue 1, January 2018 International Journal of Advance Research in Computer Science and Management Studies

Volume 6, Issue 1, January 2018 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) e-isjn: A4372-3114 Impact Factor: 7.327 Volume 6, Issue 1, January 2018 International Journal of Advance Research in Computer Science and Management Studies Research Article /

More information

Privacy-Preserving Using Data mining Technique in Cloud Computing

Privacy-Preserving Using Data mining Technique in Cloud Computing Cis-601 Graduate Seminar Privacy-Preserving Using Data mining Technique in Cloud Computing Submitted by: Rajan Sharma CSU ID: 2659829 Outline Introduction Related work Preliminaries Association Rule Mining

More information

Report: Privacy-Preserving Classification on Deep Neural Network

Report: Privacy-Preserving Classification on Deep Neural Network Report: Privacy-Preserving Classification on Deep Neural Network Janno Veeorg Supervised by Helger Lipmaa and Raul Vicente Zafra May 25, 2017 1 Introduction In this report we consider following task: how

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Privacy Protected Spatial Query Processing

Privacy Protected Spatial Query Processing Privacy Protected Spatial Query Processing Slide 1 Topics Introduction Cloaking-based Solution Transformation-based Solution Private Information Retrieval-based Solution Slide 2 1 Motivation The proliferation

More information

Efficient Algorithm for Frequent Itemset Generation in Big Data

Efficient Algorithm for Frequent Itemset Generation in Big Data Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru

More information

The Pre-Image Problem in Kernel Methods

The Pre-Image Problem in Kernel Methods The Pre-Image Problem in Kernel Methods James Kwok Ivor Tsang Department of Computer Science Hong Kong University of Science and Technology Hong Kong The Pre-Image Problem in Kernel Methods ICML-2003 1

More information

Differential Fault Analysis on the AES Key Schedule

Differential Fault Analysis on the AES Key Schedule ifferential Fault Analysis on the AES Key Schedule Junko TAKAHASHI and Toshinori FUKUNAGA NTT Information Sharing Platform Laboratories, Nippon Telegraph and Telephone Corporation, {takahashi.junko, fukunaga.toshinori}@lab.ntt.co.jp

More information

A Cloud Framework for Big Data Analytics Workflows on Azure

A Cloud Framework for Big Data Analytics Workflows on Azure A Cloud Framework for Big Data Analytics Workflows on Azure Fabrizio MAROZZO a, Domenico TALIA a,b and Paolo TRUNFIO a a DIMES, University of Calabria, Rende (CS), Italy b ICAR-CNR, Rende (CS), Italy Abstract.

More information

Exact Optimized-cost Repair in Multi-hop Distributed Storage Networks

Exact Optimized-cost Repair in Multi-hop Distributed Storage Networks Exact Optimized-cost Repair in Multi-hop Distributed Storage Networks Majid Gerami, Ming Xiao Communication Theory Lab, Royal Institute of Technology, KTH, Sweden, E-mail: {gerami, mingx@kthse arxiv:14012774v1

More information

Structured System Theory

Structured System Theory Appendix C Structured System Theory Linear systems are often studied from an algebraic perspective, based on the rank of certain matrices. While such tests are easy to derive from the mathematical model,

More information

Discovering Dependencies between Virtual Machines Using CPU Utilization. Renuka Apte, Liting Hu, Karsten Schwan, Arpan Ghosh

Discovering Dependencies between Virtual Machines Using CPU Utilization. Renuka Apte, Liting Hu, Karsten Schwan, Arpan Ghosh Look Who s Talking Discovering Dependencies between Virtual Machines Using CPU Utilization Renuka Apte, Liting Hu, Karsten Schwan, Arpan Ghosh Georgia Institute of Technology Talk by Renuka Apte * *Currently

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

Breaking Grain-128 with Dynamic Cube Attacks

Breaking Grain-128 with Dynamic Cube Attacks Breaking Grain-128 with Dynamic Cube Attacks Itai Dinur and Adi Shamir Computer Science department The Weizmann Institute Rehovot 76100, Israel Abstract. We present a new variant of cube attacks called

More information

Clustering and Association using K-Mean over Well-Formed Protected Relational Data

Clustering and Association using K-Mean over Well-Formed Protected Relational Data Clustering and Association using K-Mean over Well-Formed Protected Relational Data Aparna Student M.Tech Computer Science and Engineering Department of Computer Science SRM University, Kattankulathur-603203

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

A Modified Version of Hill Cipher

A Modified Version of Hill Cipher A Modified Version of Hill Cipher A.F.A.Abidin 1, O.Y.Chuan 2 Faculty of Informatics Universiti Sultan Zainal Abidin 21300 Kuala Terengganu, Terengganu, Malaysia. M.R.K.Ariffin 3 Institute for Mathematical

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Jian Liu, Sara Ramezanian

Jian Liu, Sara Ramezanian CloSer WP2: Privacyenhancing Technologies Jian Liu, Sara Ramezanian Overview Seek to understand how user privacy is impacted by cloud-assisted security services Develop a suite of privacy-enhancing technologies

More information

Secure Multiparty Computation

Secure Multiparty Computation Secure Multiparty Computation Li Xiong CS573 Data Privacy and Security Outline Secure multiparty computation Problem and security definitions Basic cryptographic tools and general constructions Yao s Millionnare

More information

Attribute-based encryption with encryption and decryption outsourcing

Attribute-based encryption with encryption and decryption outsourcing Edith Cowan University Research Online Australian Information Security Management Conference Conferences, Symposia and Campus Events 2014 Attribute-based encryption with encryption and decryption outsourcing

More information

Lectures 6+7: Zero-Leakage Solutions

Lectures 6+7: Zero-Leakage Solutions Lectures 6+7: Zero-Leakage Solutions Contents 1 Overview 1 2 Oblivious RAM 1 3 Oblivious RAM via FHE 2 4 Oblivious RAM via Symmetric Encryption 4 4.1 Setup........................................ 5 4.2

More information

EXECUTION OF PRIVACY - PRESERVING MULTI-KEYWORD POSITIONED SEARCH OVER CLOUD INFORMATION

EXECUTION OF PRIVACY - PRESERVING MULTI-KEYWORD POSITIONED SEARCH OVER CLOUD INFORMATION EXECUTION OF PRIVACY - PRESERVING MULTI-KEYWORD POSITIONED SEARCH OVER CLOUD INFORMATION Sunitha. N 1 and Prof. B. Sakthivel 2 sunithank.dvg@gmail.com and everrock17@gmail.com 1PG Student and 2 Professor

More information

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist *ICSI Berkeley Zuse Institut Berlin 4/26/2010 Joos-Hendrik Boese Slide

More information

ISSN Vol.08,Issue.16, October-2016, Pages:

ISSN Vol.08,Issue.16, October-2016, Pages: ISSN 2348 2370 Vol.08,Issue.16, October-2016, Pages:3146-3152 www.ijatir.org Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revocation VEDIRE AJAYANI 1, K. TULASI 2, DR P. SUNITHA

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

A generic and distributed privacy preserving classification method with a worst-case privacy guarantee

A generic and distributed privacy preserving classification method with a worst-case privacy guarantee Distrib Parallel Databases (2014) 32:5 35 DOI 10.1007/s10619-013-7126-6 A generic and distributed privacy preserving classification method with a worst-case privacy guarantee Madhushri Banerjee Zhiyuan

More information

Efficiency Optimisation Of Tor Using Diffie-Hellman Chain

Efficiency Optimisation Of Tor Using Diffie-Hellman Chain Efficiency Optimisation Of Tor Using Diffie-Hellman Chain Kun Peng Institute for Infocomm Research, Singapore dr.kun.peng@gmail.com Abstract Onion routing is the most common anonymous communication channel.

More information

REGULAR GRAPHS OF GIVEN GIRTH. Contents

REGULAR GRAPHS OF GIVEN GIRTH. Contents REGULAR GRAPHS OF GIVEN GIRTH BROOKE ULLERY Contents 1. Introduction This paper gives an introduction to the area of graph theory dealing with properties of regular graphs of given girth. A large portion

More information

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!

More information

Integration of information security and network data mining technology in the era of big data

Integration of information security and network data mining technology in the era of big data Acta Technica 62 No. 1A/2017, 157 166 c 2017 Institute of Thermomechanics CAS, v.v.i. Integration of information security and network data mining technology in the era of big data Lu Li 1 Abstract. The

More information

IMAGE DENOISING USING NL-MEANS VIA SMOOTH PATCH ORDERING

IMAGE DENOISING USING NL-MEANS VIA SMOOTH PATCH ORDERING IMAGE DENOISING USING NL-MEANS VIA SMOOTH PATCH ORDERING Idan Ram, Michael Elad and Israel Cohen Department of Electrical Engineering Department of Computer Science Technion - Israel Institute of Technology

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com

More information

A Chosen-Plaintext Linear Attack on DES

A Chosen-Plaintext Linear Attack on DES A Chosen-Plaintext Linear Attack on DES Lars R. Knudsen and John Erik Mathiassen Department of Informatics, University of Bergen, N-5020 Bergen, Norway {lars.knudsen,johnm}@ii.uib.no Abstract. In this

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

Privacy Preserving Data Mining Technique and Their Implementation

Privacy Preserving Data Mining Technique and Their Implementation International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 4, Issue 2, 2017, PP 14-19 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) DOI: http://dx.doi.org/10.20431/2349-4859.0402003

More information

Rapid growth of massive datasets

Rapid growth of massive datasets Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,

More information

Cooperative Private Searching in Clouds

Cooperative Private Searching in Clouds Cooperative Private Searching in Clouds Jie Wu Department of Computer and Information Sciences Temple University Road Map Cloud Computing Basics Cloud Computing Security Privacy vs. Performance Proposed

More information

Elements of Cryptography and Computer and Networking Security Computer Science 134 (COMPSCI 134) Fall 2016 Instructor: Karim ElDefrawy

Elements of Cryptography and Computer and Networking Security Computer Science 134 (COMPSCI 134) Fall 2016 Instructor: Karim ElDefrawy Elements of Cryptography and Computer and Networking Security Computer Science 134 (COMPSCI 134) Fall 2016 Instructor: Karim ElDefrawy Homework 2 Due: Friday, 10/28/2016 at 11:55pm PT Will be posted on

More information

SECURE MULTI-KEYWORD TOP KEY RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA

SECURE MULTI-KEYWORD TOP KEY RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA Research Manuscript Title SECURE MULTI-KEYWORD TOP KEY RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA Dr.B.Kalaavathi, SM.Keerthana, N.Renugadevi Professor, Assistant professor, PGScholar Department of

More information

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER Hardware Sizing Using Amazon EC2 A QlikView Scalability Center Technical White Paper June 2013 qlikview.com Table of Contents Executive Summary 3 A Challenge

More information