arxiv: v1 [cs.cr] 22 Apr 2015 ABSTRACT

Size: px

Start display at page:

Download "arxiv: v1 [cs.cr] 22 Apr 2015 ABSTRACT"

Bryce Hall
6 years ago
Views:

1 Differentially Private -Means Clustering arxiv:10.0v1 [cs.cr] Apr 01 ABSTRACT Dong Su #, Jianneng Cao, Ninghui Li #, Elisa Bertino #, Hongxia Jin There are two broa approaches for ifferentially private ata analysis. The interactive approach aims at eveloping customize ifferentially private algorithms for various ata mining tass. The non-interactive approach aims at eveloping ifferentially private algorithms that can output a synopsis of the input ataset, which can then be use to support various ata mining tass. In this paper we stuy the traeoff of interactive vs. non-interactive approaches an propose a hybri approach that combines interactive an noninteractive, using -means clustering as an example. In the hybri approach to ifferentially private -means clustering, one first uses a non-interactive mechanism to publish a synopsis of the input ataset, then applies the stanar -means clustering algorithm to learn cluster centrois, an finally uses an interactive approach to further improve these cluster centrois. We analyze the error behavior of both non-interactive an interactive approaches an use such analysis to ecie how to allocate privacy buget between the non-interactive step an the interactive step. Results from extensive experiments support our analysis an emonstrate the effectiveness of our approach. 1. INTRODUCTION In recent years, a large an growing boy of literature has investigate ifferentially private ata analysis. Broaly, they can be classifie into two approaches. The interactive approach aims at eveloping customize ifferentially private algorithms for specific ata mining tass. One ientifies the queries that nee to be answere for the ata mining tas, analyze their sensitivity, an then answers them by aing appropriate noises. The non-interactive approach aims at eveloping an approach to compute, in a ifferentially private way, a synopsis of the input ataset, which can then be use to generate a synthetic ataset, or to irectly support various ata mining tass. An intriguing question is which of the two approaches is better? Given an input ataset D, the esire privacy parameter ǫ, which we refer to as the privacy buget, an one or more ata # Department of Computer Science, Purue University {su1, ninghui, bertino}@cs.purue.eu Institute for Infocomm Research, Singapore caojn@ir.a-star.eu.sg Samsung Information Systems of America hongxia.jin@sisa.samsung.com analysis tass, shoul one use the interactive approach or the noninteractive approach? This question is largely open. In general, the non-interactive approach has the avantage that once a synopsis is constructe, many analysis tass can be conucte on the synopsis. In contrast, using the interactive approach, one is limite to executing the interactive algorithm just once; any aitional access to the ataset woul violate ifferential privacy. Therefore, strictly speaing, a ataset can serve only one analyst, an for only one tas. (One coul ivie the privacy buget for multiple analysts an/or multiple tass, but then the accuracy for each tas will suffer.) On the other han, because the interactive approach is esigne specifically for a particular ata mining tas, one might expect that, uner the same privacy buget it shoul be able to prouce more accurate results than the non-interactive approach. In this paper we initiate the stuy of the traeoff of interactive vs. non-interactive approaches, using-means clustering as the example. Clustering analysis plays an essential role in ata management tass. Clustering has also been use as a prime example to illustrate the effectiveness of interactive ifferentially private ata analysis [, 11,,, 1,, 0]. There are three state of the art interactive algorithms. The first is the ifferentially private version of the Lloy algorithm [, ], which we call DPLloy. The secon algorithm uses the sample an aggregation framewor [] an is implemente in the GUPT system [1], which we call GM. The thir an most recent one, which we call PGM, uses Priv- Gene [0], a framewor for ifferentially private moel fitting base on genetic algorithms. To the best of our nowlege, performing -means clustering using the non-interactive approach has not been explicitly propose in the literature. In this paper, we propose to combine the following non-interactive ifferentially private synopsis algorithms with - means clustering. The ataset is viewe as a set of points over a -imensional omain, which is ivie into M equal-size cells, an a noisy count is obtaine from each cell. A ey ecision is to choose the parameter M. A larger M value means lower average counts for each cell, an therefore noisy counts are more liely to be ominate by noises. A smallerm value means larger cells, an therefore one has less accurate information of where the points are. ) We propose a metho that sets M = ( Nǫ +, which is erive base on extening the analysis in [], which aims to minimize errors when answering rectangular range queries for -imensional ata, to higher imensional case. We call the resulting -means algorithm EUGM, where EUG is for Extene Uniform Gri. We conucte extensive experimental evaluations for these algorithms on external atasets an 1 atasets that we synthesize by varying the imension from to an the number of clusters from to. Experimental results are quite interesting. GM was 1

2 introuce after DPLloy an was claime to have accuracy avantage over DPLloy, an PGM was introuce after an compare GM. However, we foun that DPLloy is the best metho among these three methos. In the comparison of DPLloy an GM in [1], DPLloy was run using much larger number of iterations than necessary, an thus perform poorly. In [0], PGM was compare only with GM, an not with DPLloy. More specifically, we foun that GM is by far the worst among all methos. Through experimental analysis of the sources of the errors, we foun that it is possible to ramatically improve the accuracy of GM by choosing smaller partitions in the sample an aggregation framewor. After this improvement, GM becomes competitive with PGM. However, DPLloy, the earliest metho is clearly the best performing algorithm among the interactive algorithms. Through analysis, we foun that why DPLloy outperforms PGM. The genetic programming style PGM nees more iterations to converge. When maing these algorithms ifferentially private, the privacy buget is ivie among all iterations, thus having more iterations means more noise is ae to each iteration. Therefore, the more irect DPLloy outperforms PGM. The most intriguing results are those comparing DPLloy with EUGM. For most atasets, EUGM performs much better than DPLloy. For a few, they perform similarly, an for two atasets DPLloy outperforms EUGM. Through further theoretical an empirical analysis, we foun that while the performance of both algorithms are greatly affecte by the two ey parameters an, they are affecte ifferently by these two parameters. DPLloy scales worse when increases, while EUGM scales worse when increases. Again we use analysis to emonstrate why this is the case. An intriguing question is can we further improve DPLloy? The accuracy of DPLloy is affecte by two ey factors: the number of iterations an the choice of initial centrois. In fact, these two are closely relate. If the initially chosen centrois are very goo an close to the true centrois, one only nees perhaps one iteration to improve it, an this reuction in the number of iterations woul mean little noise is ae. This leas us to propose a novel hybri metho that combines non-interactive EUGM with interactive DPLloy. We first use half the privacy buget to run EUGM, an then use the centrois outputte by EUGM as the initial centrois for one roun of DPLloy. Such a metho, however, may not actually outperform EUGM, especially when the privacy buget ǫ is small, since then one roun of DPLloy may actually worsen the centrois. We use our error analysis formulas to etermine whether there is sufficient privacy buget for such a hybri approach to outperform EUGKM. We then experimentally valiate the effectiveness of the Hybri approach. The hybri iea is applicable to general private ata analysis tass which require parameter tuning. In the no-privacy setting, one typically tunes parameters by builing moels for several parameters an selecting the one which offers the best utility. Uner the ifferential privacy setting, such in of parameter tuning proceure oes not wor well since the limite privacy buget might be over-ivie by trying many ifferent parameters. Chauhuri et al. [] propose a metho for private parameter tuning by taing avantage of parallel composition. The iea is to buil private moels with ifferent parameters on separate subset of the ataset an evaluate moels on a valiation set. The best parameter is chosen via exponential mechanism with quality function efine by the evaluation score. However, this approach is also not scalable well over a large set of caniate parameters which might result each ata bloc to have very small number of points an therefore lea to very inaccurate moel. Our propose hybri approach offers a better solution. We can first publish private synopses of the input ata, on which we try a large set of parameters. Then, we run the interactive private analysis with the selecte parameter on the input ataset to get the final result. In this paper we avance the state of art on ifferentially private ata mining in several ways. First, we have introuce noninteractive methos for ifferentially private -means clustering, which are highly effective an often outperform state of the art interactive methos. Secon, we have extensively evaluate three interactive methos, an one non-interactive methos, an analyze their strengths an weanesses. Thir, we have evelope techniques to analyze the error resulte from both DPLloy an EU- GM. Finally, we introuce the novel concept of hybri approach to ifferentially private ata analysis, which is so far the best approach to -means clustering. We conjecture that the concept of hybri ifferential privacy approach may prove useful in other analysis tass as well. The rest of the paper is organize as follows. In Section, we iscuss relate wor. In Section, we give preliminary information about ifferential privacy an-means clustering. In Section, we escribe the existing three interactive approaches, DPLloy, GM, PGM an one non-interactive approache EUGM. In Section, we first show the experimental results on the performance comparison among the interactive an non-interactive approaches, an analyze their strengths an weanesses. In Section we stuy the error behavior of DPLloy an EUGM, introuce the hybri approach, an compare these with existing algorithms. We conclue in Section.. RELATED WORK The notion of ifferential privacy was evelope in a series of papers [, 1,, 1, ]. Several primitives for answering a single query ifferentially privately have been propose. Dwor et al. [1] introuce the metho of aing Laplacian noise scale with the sensitivity. McSherry an Talwar [] introuce a more general exponential mechanism. Nissim et al. [] propose aing noises proportion to local sensitivity. Blum et al. [] propose a sublinear query (SuLQ) atabase moel for interactively answering a sublinear number (in the size of the unerlying atabase) of count queries ifferential privately. The users (e.g. machine learning algorithms) issue queries an get responses which are ae laplace noises. They applie the SuLQ framewor to the -means clustering an some other machine learning algorithms. McSherry [] built the PINQ (Privacy INtegrate Queries) system, a programming platform which provies several ifferentially-private primitives to enable ata analysts to write privacy-preserving applications. These private primitives inclue noisy count, noisy sum, noisy average, an exponential mechanism. The DPLloy algorithm, which we compare against in this paper, has been implemente using these primitives. Another programming framewor with ifferential privacy support is Airavat, which maes programs using the MapReuce framewor ifferentially private []. Nissim et al. [, ] propose the sample an aggregate framewor (SAF), an use -means clustering as a motivating application for SAF. This SAF framewor has been implemente in the GUPT system [1] an is evaluate by -means clustering. This is the GM algorithm that we compare with in the paper. Dwor [11] suggeste applying a geometric ecreasing privacy buget allocation strategy among the iterations of -means, whereas we use an increasing sequence. Geometric ecreasing sequence will cause later rouns using increasingly less privacy buget, resulting in

3 higher an higher istortion with each new iteration. Zhang et al. [0] propose a general private moel fitting framewor base on genetic algorithms. The PGM approach in this paper is an instantiation of the framewor to-means clustering. Interactive methos for other ata mining tass have been propose. McSherry an Mironov [] aapte algorithms proucing recommenations from collective user behavior to satisfy ifferential privacy. Frieman an Schuster [1] mae the ID ecision tree construction algorithm ifferentially private. Chauhuri an Monteleoni [] propose a ifferentially private logistic regression algorithm. Zhang et al. [1] introuce the functional mechanism, which perturbs an optimization objective to satisfy ifferential privacy, an applie it to linear regression an logistic regression. Differentially private frequent itemset mining has been stuie in [, ]. The traeoffs of interactive an non-interactive approaches in these omains are interesting future research topics. Most non-interactive approaches aim at eveloping solutions to answer histogram or range queries accurately [1,, 1, ]. Dwor et al. [1] calculate the frequency of values an release their istribution ifferentially privately. Such metho maes the variance of query result increase linearly with the query size. To aress this issue, Xiao et al. [] propose a wavelet-base metho, by which the variance is polylogarithmic to the query size. Hay et al. [1] organize the count queries in a hierarchy, an improve the accuracy by enforcing the consistency between the noisy count value of a parent noe an those of its chilren. Cormoe et al. [] aapte stanar spatial inexing techniques, such as quatree an -tree, to ecompose ata space ifferential-privately. Qaraji et al. [] propose the UG an AG metho for publishing - imensional atasets. Mohamme et al. [] tailore the noninteractive ata release for construction of ecision trees. Roth et al. [] stuie the problem on how to release synthetic ata ifferentially privately for any set of count queries specifie in avance. They propose a ǫ-ifferentially private mechanism whose error scales only logarithmically with the number of queries being answere. However, it is not computationally efficient (super-polynomial in the ata universe size). Subsequent wor inclues [1, 0, 1,, 1, 1]. One of the typical wors is the private multiplicative weight mechanism [0] which is propose to answer count queries interactively whose error also scales logarithmically with the number of queries seen so far. Its running time is only linear in the ata universe size.. BACKGROUND.1 Differential Privacy Informally, ifferential privacy requires that the output of a ata analysis mechanism shoul be approximately the same, even if any single tuple in the input atabase is arbitrarily ae or remove. DEFINITION 1 (ǫ-differential PRIVACY [, 1]). A ranomize mechanism A gives ǫ-ifferential privacy if for any pair of neighboring atasets D an D, an any S Range(A), Pr[A(D) = S] e ǫ Pr [ A(D ) = S ]. In this paper we consier two atasetsdand to be neighbors if an only if either D = D + t or D = D + t, where D + t enotes the ataset resulte from aing the tuple t to the ataset D. We used D to enote this. This protects the privacy of any single tuple, because aing or removing any single tuple results in e ǫ -multiplicative-boune changes in the probability istribution of the output. Differential privacy is composable in the sense that combining multiple mechanisms that satisfy ifferential privacy for ǫ 1,,ǫ m results in a mechanism that satisfies ǫ-ifferential privacy for ǫ = iǫi. Because of this, we refer to ǫ as the privacy buget of a privacy-preserving ata analysis tas. When a tas involves multiple steps, each step uses a portion of ǫ so that the sum of these portions is no more thanǫ. There are several approaches for esigning mechanisms that satisfy ǫ-ifferential privacy, incluing Laplace mechanism [1] an Exponential mechanism []. The Laplace mechanism computes a function g on the ataset D by aing to g(d) a ranom noise, the magnitue of which epens on GS g, the global sensitivity or the L 1 sensitivity of g. Such a mechanism A g is ) given below: A g(d) = g(d)+lap ( GSg ǫ where GS g = max (D,D ):D D g(d) g(d ), an Pr[Lap(β) = x] = 1 β e x /β. In the above, Lap(β) enotes a ranom variable sample from the Laplace istribution with scale parameter β.. -means Clustering Algorithms The -means clustering problem is as follows: given a - imensional ataset D = {x 1,x,...,x N }, partition ata points in D into sets O = {O 1,O,,O } so that the Normalize Intra-Cluster Variance (NICV) is minimize 1 N j=1 x l O j x l o j. (1) The stanar -means algorithm is the Lloy s algorithm []. The algorithm starts by selecting points as the initial choices for the centroi. The algorithm then tries to improve these centroi choices iteratively until no improvement can be mae. In each iteration, one first partitions the ata points into clusters, with each point assigne to be in the same cluster as the nearest centroi. Then, one upates each centroi to be the center of the ata points in the cluster. i [1..] o j i x l O x l j i, () O j where j = 1,,...,,x l i ano j i are the i-th imensions of xl an o j, respectively. The algorithm continues by alternating between ata partition an centroi upate, until it converges.. THE INTERACTIVE AND NON- INTERACTIVE APPROACHES In this section, we escribe interactive approaches an noninteractive approaches to ifferential private -means clustering..1 Interactive Approaches.1.1 DPLloy Differentially private -means or LLoy s algorithm was first propose by Blum et al. [] an was later implemente in the PINQ system [], a platform for interactive privacy preserving ata analysis. We call this the DPLloy approach. DPLloy iffers from the stanar Lloy algorithm in the following ways. First, Laplacian noise is ae to the iterative upate step in the Lloy algorithm. Secon, the number of iterations nees to be fixe in orer to ecie how much noise nees to be ae in each iteration. Each iteration requires computing the total number of points in a cluster an, for each imension, the sum of the coorinates of the ata points in a cluster. Let t be the number of iterations, an be

4 the number of imensions. Then, each tuple is involve in answering t sum queries an t count queries. To boun the sensitivity of the sum query to a small number r, each imension is normalize to[ r,r]. Thus, the global sensitivity of DPLloy is(r+1)t, an each query is answere by aing Laplacian noise Lap ( (r+1)t ǫ There are two issues that greatly impact the accuracy of DPLloy. The first is the number of iterations. A large number of iterations causes too much noises being ae. A small number of iterations may be insufficient for the algorithm to converge. In [], the number of iterations is set to be, which seems to wor well for many settings. The secon is the quality of initial centrois. A poor choice of initial centrois can result in converging to a local optimum that is far from global optimum, or not converging after the given number of iterations. While many methos for choosing the initial points have been evelope [], these methos were evelope without the privacy concern an nee access to the ataset. In [], points at uniform ranom from the omain are chosen as the initial centrois. We have observe empirically that this can perform poorly in some settings, since some ranomly chosen initial centrois are close together. We thus introuce an improve metho for choosing initial centrois that is similar to the concept of sphere pacing. Given a raius a, we ranomly generate centrois one by one such that each new centroi is of istance at least a away from each borer of the omain an each new centroi is of istance at least a away from any existing centroi. When a ranomly chosen point oes not satisfy this conition, we generate another point. When we have faile repeately, we conclue that the raius a is too large, an try a smaller raius. We use a binary search to fin the maximal value for a such that it is the process of choosing centrois succee. This process is ata inepenent..1. GM The -means clustering problem was also use to motivate the sample an aggregate framewor (SAF) for satisfying ifferential privacy, which was evelope in [, ], an implemente in the GUPT system [1]. Given a ataset D an a function f, SAF first partitions D into l blocs, then it evaluates f on each of the bloc, an finally it privately aggregates results from all blocs into a single one. Since any single tuple in D falls in one an only one bloc, aing one tuple can affect at most one bloc s result, limiting the sensitivity of the aggregation step. Thus one can a less noise in the final step to satisfy ifferential privacy. As far as we now, GUPT [1] is the only implementation of SAF. Authors of [1] implemente -means clustering an use it to illustrate the effectiveness of GUPT. We call this algorithm GM. Given a ataset D, it first partitions D into l blocs D 1,D,...,D l. Then, for each bloc D b (1 b l), it calculates its centrois o b,1,o b,,...,o b,. Finally, it averages the centrois calculate from all blocs an as noise. Specifically, the i th imension of the j th aggregate centroi is o j i = 1 l l b=1 o b,j i +Lap ( (maxi min i) l ǫ ). ), () where o b,j i is the i th imension of o b,j, [min i,max i] is the estimate output range ofi th imension. One half of the total privacy buget is use to estimate this output range, an the other half is use for aing Laplace noise. We have foun that the implementation ownloae from [0], which uses Equation (), performe poorly. Analyzing the ata closely, we foun that min i an max i often fall outsie of the ata range, especially for small ǫ. We slightly moifie the coe to bounmin i anmax i to be within the ata omain. This oes not affect the privacy, was able to greatly improve the accuracy. In this paper we use this fixe version. Here a ey parameter is the choice of l. Intuitively, a larger l will result in each bloc being very small an unable to preserve the cluster information in the blocs, an a smaller l, on the other han, results in large noise ae. (Note the inverse epenency on l in Equation (). Analysis in [1] suggests to set l = N 0.. Our experimental results, however, show that the performance is quite poor. We consier a variant that chooses l = N, i.e., having each bloc containing points, which performs much better than settingl = N PGM PrivGene [0] is a general-purpose ifferentially private moel fitting framewor base on genetic algorithms. Given a ataset D an a fitting-score function f(d, θ) that measures how well the parameter θ fits the ataset D, the PrivGene algorithm initializes a caniate set of possible parameters θ an iteratively refines them by mimicing the process of natural evolution. Specifically, in each iteration, PrivGene uses the exponential mechanism [] to privately select from the caniate set m parameters that have the best fitting scores, an generates a new caniate set from the m selecte parameters by crossover an mutation. Crossover regars each parameter as an l-imensional vector. Given two parameter vectors, it ranomly selects a number l such that 0 < l < l an splits each vector into the first l imensions in the vector an the remainingl l imensions (the lower half). Then, it swaps the lower halves of the two vectors to generate two chil vectors. These vectors are then mutate by aing a ranom noise to one ranomly chosen imension. In [0], PrivGene is applie to logistic regression, SVM, an -means clustering. In the case of -means clustering, the NICV formula in Equation 1, more precisely its non-normalize version, is use as the fitting function f, an the set of cluster centrois is efine as parameter θ. Each parameter is a vector of l = imensions. Initially, the caniate set is populate with 00 sets of cluster centrois ranomly sample from the ata space, each set containing exactly centrois. Then, the algorithm runs iteratively for max{,(xnǫ)/m } rouns, where x an m are empirically set to1. an, respectively, an N is the ataset size. We call the approach of applying PrivGene to -means clustering PGM, which is similarly to DPLloy in that it tries to iteratively improve the centrois. However, rather than maintaining an improving a single set of centrois, PGM maintains a pool of caniates, uses selection to improve their quality, an crossover an mutation to broaen the pool. Similar to DPLloy, a ey parameter is the number of iterations. Too few iterations, the algorithm may not converge. Too many iterations means too little privacy buget for each iteration, an the exponential mechanism may not be able to select goo caniates.. Non-interactive Approaches Interactive approaches such as DPLloy an GM suffer from two limitations. First, often times the purpose of conucting - means clustering is to visualize how the ata points are partitione into clusters. The interactive approaches, however, output only the centrois. In the case of DPLloy, one coul also obtain the number of ata points in each cluster; however, it cannot provie more etaile information on what shapes ata points in the clusters tae. The value of interactive private -means clustering is thus limite. Secon, as the privacy buget is consume by the interactive metho, one cannot perform any other analysis on the ataset;

5 oing so will violate ifferential privacy. Non-interactive approaches, which first generate a synopsis of a ataset using a ifferentially private algorithm, an then apply -means clustering algorithm on the synopsis, avoi these two limitations. In this paper, we consier the following synopsis metho. Given a -imensional ataset, one partitions the omain into M equal-with gri cells, an then releases the noisy count in each cell, by aing Laplacian noise to each cell count. The synopsis release is a set of cells, each of which has a rectangular bouning box an a (noisy) count of how many ata points are in the bouning box. The synopsis tells only how many points are in a cell, but not the exact locations of these points. For the purpose of clustering, We treat all points as if they are at the center of the bouning box. In aition, these noisy counts might be negative, non-integer, or both. A straightforwar solution is to roun the noisy count of a cell to be a non-negative nearest integer an replicate the cell center as many as the roune count. This approach, however, may introuce a significant systematic bias in the clustering result, when many cells in theug synopsis are empty or close to empty an these cells are not istribute uniformly. In this case, simply turning negative counts to zero can prouce a large number of points in those empty areas, which can pull the centroi away from its true position. We tae the approach of eeping the noisy count unchange an aapting the centroi upate proceure in-means to use the cell as a whole. Specifically, given a cell with center c an noisy count ñ, its contribution to the centroi is c ñ. Using this approach, in one cluster, cells who have negative noisy count can cancel out the effect of other cells with positive noise. Therefore, we can have better clustering performance. For this metho, the ey parameter is M, the number of cells. When M is large, the average count per cell is low, an the noise will have more impact. When M is small, each cell covers a large area, an treating all points as at the center may be inaccurate when the points are not uniformly istribute. We now escribe two methos of choosing M...1 EUGM Qaraji et al. [] stuie the effectiveness of proucing ifferentially private synopses of -imensional atasets for answering rectangular range counting queries (i.e., how many ata points there are in a rectangular range) with high accuracy, an suggeste choosing M = Nǫ. We now analyze the choice of M for higherimensional case. We use mean square error (MSE) to measure the accuracy of est with respect toact. That is, MSE(est) = E [ (est act) ] = Var(est)+(Bias(est)), where Var(est) is the variance of est an Bias(est) is its bias. There are two error sources when computing est. First, Laplace noises are ae to cell counts to satisfy ifferential privacy. This results in the variance of est. Since counting a cell size has the sensitivity of 1, Laplace noise Lap ( 1 ǫ) is ae. Thus, the noisy count has the variance of. Suppose that the given counting query ǫ covers α portion of the total M cells in the ata space. Then, Var(est) = α M. Secon, the given counting query may not fully ǫ contain the cells that fall on the borer of the query rectangle. To estimate the number of points in the intersection between the query rectangle an the borer cells, it assumes that ata are uniformly istribute. This results in the bias of est, which epens on the number of tuples in the borer cells. The borer of the given query consists of hyper rectangles, each being ( 1)-imensional. The number of cells falling on a hyper rectangle is in the orer of M 1. On average the number of tuples in these cells is in the orer ofm 1 N M = N M 1. Therefore, we estimate the bias of est with respect to one hyper rectangle to be β N, where β 0 is M 1 ( ) a parameter. We thus estimate (Bias(est)) to be β N. M 1 Summing the variance an the square bias, it follows that MSE(est) = α M N ǫ +β. M To minimize the MSE, we set the erivative of the above equation with respect tom to 0. This gives M = ( ) Nǫ +, () θ where θ = α. We name the above extene approach aseug β (extene uniform griing approach). We use EUGM to represent the EUG-base -means clustering scheme.. PERFORMANCE AND ANALYSIS In this section, we compare an analyze the performance of the five methos introuce in the last section..1 Evaluation Methoology We experimente with six external atasets an a group of syntheticly generate atasets. The first ataset is a D synthetic ataset S1 [1], which is a benchmar to stuy the performance of clustering schemes. S1 contains,000 tuples an 1 Gaussian clusters. The Gowalla ataset contains the user checin locations from the Gowalla location-base social networ whose users share their checing-in time an locations (longitue an latitue). We tae all the unique locations, an obtain a D ataset of,01 tuples. We set = for this ataset. The thir ataset is a 1- percentage sample of roa ataset which was rawn from the 00 TIGER (Topologically Integrate Geographic Encoing an Referencing) ataset []. It contains the GPS coorinates of roa intersections in the states of Washington an New Mexico. The fourth is Image [1], a D ataset with,11 RGB vectors. We set = for it. We also use the well nown Ault ataset [1]. We use its six numerical attributes, an set =. The last ataset is Lifesci. It contains, recors an each of them consists of the top principal components for a chemistry or biology experiment. As previous approaches [1, 0], we set =. Table 1 summarizes the atasets. For all the atasets, we normalize the omain of each attribute to [-, ]. When generating the synthetic atasets, we fix the ataset size to,000, an vary an from to. For each ataset, well separate Gaussian clusters of equal size are generate, an 0 sets of initial centrois are generate in the same way as in Section.1.1. Implementations for DPLloy an GM were ownloae from [] an [0], respectively. The source coe of PGM [0] was share by the authors. We implemente EUGM. Configuration. Each algorithm outputs centrois o = {o 1,o,,o }. To evaluate the quality of such an output o, we compute the average square istance between any ata point ind an the nearest centroi ino, an call this the NICV. We note that since both DPLloy an EUGM use Lloy-style iteration, they are affecte by the choice of initial centrois. In aition, all algorithms have ranom noises ae somewhere to satisfy ifferential privacy. To conuct a fair comparison, we nee to carefully average out such ranomness effects. GM an PGM

6 Table 1: Descriptions of the Datasets. Dataset # of tuples l GM l GM-K S1, Gowalla,01,1 TIGER 1,1,1 Image,11,0 Ault-num,1, Lifesci,,0 Synthetic,000 [, ] [, ] 0 000/() o not tae a set of initial centrois as input. GM ivies the input ataset into multiple blocs, an for each bloc invoes the stanar -means implementation from the Scipy pacage [] with a ifferent set of initial centrois to get the result, an finally aggregates the outputs for all the blocs. We run GM an PGM 0 times an report the average result. For DPLloy, we generate 0 sets of initial centrois, run DPLloy 0 times on each set of initial centrois, an we report the average of the 000 NICV values as the final evaluation of DPLloy. The non-interactive approach (EUGM) has the avantage that once a synopsis is publishe, one can run -means clustering with as many sets of initial centrois as one wants an choose the result that has the best performance relative to the synopsis. In our experiments, given a synopsis, we use the same 0 sets of initial centrois as those generate for the DPLloy metho. For each set, we run clustering an output a set of centrois. Among all the 0 sets of output centrois, we select the one that has the lowest NICV relative to the synopsis rather than to the original ataset. This process ensures selecting the set of output centrois satisfies ifferential privacy. We then compute the NICV of this selecte set relative to the original ataset, an tae it as the resulting NICV with respect to the synopsis. To eal with the ranomness introuce by the process of generating synopsis, we generate ifferent synopses an tae the average of the resulting NICV. As the baseline, we run stanar -means algorithm [] over the same 0 sets of initial centrois an tae the minimum NICV among all the 0 runs. Experimental Results. Figure 1 reports the results for the external atasets. For these, we vary ǫ from 0.0 to.0 an plot the NICV curve for the methos mentione in Section. This enables us to see how these algorithms perform uner ifferent ǫ. Figure reports the results for the synthetic atasets. For these, we fix ǫ = an report the ifference of NICV between each approach an the baseline. This enables us to see the scalability of these algorithms when an increase. For interactive approaches, DPLloy has the best performance in most cases. Its performance is worse than that of PGM only on the small ataset S1 when the privacy buget ǫ is smaller than. Comparing DPLloy an EUGM, we observe that in the four low imensional atasets (S1, Gowalla, TIGER an Image), EUGM clearly outperforms DPLloy at small ǫ value an their gap becomes smaller as ǫ increases. However, in the two high imensional atasets (Ault-num an Lifesci), DPLloy outperforms EUGM almost in all given privacy bugets. Similar results can also be foun in Figure. Figure also exhibits the effects of the number of clusters an the number of imensions. The EUGM s performance is more sensitive to the increase of imension, while DPLloy gets worse quicly as the number of clusters increases. Below we analyze these algorithms to unerstan why they perform in this way. In aition, Figure shows the ifference of EUGM s performance on ifferent θ choices. Setting θ = for EUGM wors well in most cases.. The Analysis of the GM Approach From Figures 1 an, it is clear that GM is always much worse than others. There are two sources of errors for GM. One is that GM is aggregating centrois compute from the subsets of ata, an this aggregate may be inaccurate even without aing noise. The other is that the noise ae accoring to Equation () may be too large. To tease out the role playe by these two error sources, Figure shows the effect of varying bloc size from aroun N to N. It shows error from GM, error from using the aggregation without noise (SAG), an error from aing noise compute by Equation ) to the best nown centrois (Noise). From the figure, it is clear that setting l = N 0., which correspons to bloc size of N 0. is far from optimal, as the error GM is ominate by that from the noise, an is much higher than the error ue to sample an aggregation. Inee, we observe that as the bloc size ecreases the error of GM eeps ecreasing, until when the bloc size gets close to. It seems that even though many iniviual blocs result in poor centrois, aggregating these relatively poor centrois can result in highly accurate centrois. This effect is most pronounce in the Tiger ataset, which consists of two large clusters. The two centrois compute from each small bloc can be approximately viewe as choosing one ranom point from each cluster. When averaging these centrois, one gets very close to the true centrois. This observation motivate the introuction of GM-K algorithm, which fixes each bloc size to be. Recall that we are to select centrois from each bloc. As can be seen from Figures 1 an, GM-K becomes competitive with PGM, sometimes significantly outperforms PGM (e.g. TIGER an Lifesci), although it still unerperforms DPLloy.. The Analysis of the PGM Approach PGM is a stochastic -means metho base on genetic algorithms. A stochastic metho converges to global optimum []. On the contrary, DPLloy is a graient escent metho erive from the stanar Lloy s algorithm [], which may reach local optimum. However, PGM is still inferior to DPLloy in Figure 1. There are two possible reasons. First, a stochastic approach typically taes a larger number of iterations to converge []. Figure compares the Lloy s algorithm with Gene (i.e., the nonprivate version of PGM without consiering ifferential privacy). For Lloy, we reuse the initial centrois generate in Section.1. Given a ataset, we run Lloy on the 0 sets of initial centrois generate for the ataset, an report the average NICV. Generally, Gene overtaes Lloy as the number of iterations increases an finally converges to the global optimum. However, Lloy improves its performance much faster than Gene in the first few iterations, an converges to the global optimal (or local optimum) more quicly. For example, in the Image ataset, Lloy reaches the best baseline after three iterations, while the Gene nees more than iterations to achieve the same. The secon reason that PGM is inferior to DPLloy is the low privacy buget allocate to select a parameter (i.e., a set of cluster centrois) from the caniate set. In each iteration PGM selects parameters, an the total number of iterations is at least. Thus, the privacy buget allocate to select a single parameter is at most ǫ/0. Therefore, PGM has reasonable performance only for bigǫvalue.. THE HYBRID APPROACH Experimental results in Section establish that DPLloy is the best performing interactive metho; however, it still unerperforms EUGM. Recall that EUGM publishes a private syn-

7 Privacy Buget ε, log scale Privacy Buget ε, log scale (a) S1 [ =, = 1] (b) Image [ =, = ] Privacy Buget ε, log scale Privacy Buget ε, log scale (c) Gowalla [ =, = ] () Ault-num [ =, = ] Privacy Buget ε, log scale Privacy Buget ε, log scale (e) TIGER [ =, = ] (f) Lifesci [ =, = ] Figure 1: The comparison of DPLloy, EUGM, PGM an GM. x-axis: privacy bugetǫ in log-scale. y-axis: NICV in log-scale. opsis of the the ataset, an thus enables other analysis to be performe on the ataset, beyon -means. This means that currently the non-interactive metho has a clear avantage over interactive methos, at least for -means clustering. An intriguing question is Whether EUGM is the best we can o for -means clustering? In particular, can we further improve DPLloy? Recall that there are two ey issues that greatly affect the accuracy of DPLloy: the number of iterations an the choice of initial centrois. In fact, these two are closely relate. If the initially chosen centrois are very goo an close to the true centrois, one only nees perhaps one more iteration to improve it, an this reuction in the number of iterations woul mean little noise is ae. Now if only we have a metho to choose really goo centrois in a ifferentially private way, then we can use part (e.g., half) of the privacy buget to get those initial centrois, an the remaining privacy buget to run one iteration of DPLloy to further improve it. In fact, we o have such a metho. EUGM oes it. This leas us to propose a hybri metho that combines non-interactive EUGM with interactive DPLloy. We first use half the privacy buget to run EUGM, an then use the centrois outputte by EUGM as the initial centrois for one roun of DPLloy. Such a metho, however, may not actually outperform EUGM, especially when the privacy buget ǫ is small, since then one roun of DPLloy may actually worsen the centrois. Therefore, when ǫ is small, we shoul stic to the EUGM metho, an only when ǫ is large enough shoul we aopt the EUGM+DPLloy approach. In orer to etermine what ǫ is large enough, we analyze how the errors

8 (a) DPLloy (b) PGM (c) GM () GM-K (e) EUGM (f) EUGMθ = (g) EUGMθ = (h) EUGMθ = (i) EUGMθ = (j) EUGMθ = 0 () EUGMθ = Figure : The heatmap by varying an epen on the various parameters in DPLloy an in EUGM..1 Error Stuy of DPLloy DPLloy as noises to each iteration of upating centrois. To stuy the error behavior of DPLloy ue to the injecte Laplace noises, we focus on analyzing the mean square error (MSE) between noisy centrois an true centrois in one iteration. Consier one centroi an its upate in one iteration. The true centroi s i th imension shoul be o i = S i, where C is the number of ata points in the cluster ans i is the sum of i th imension C coorinates of ata points in the cluster. Consier the noisy centroi ô; its i th imension is ô i = S i+ S i, where C is the noise C+ C ae to the count an S i is the noise ae to the S i. The MSE is thus: [ ( ) ] Si + S i MSE(ô) = E C + C Si () C i=1 Derivation base on the above formula gives the following proposition.

9 SAG GM Noise K SAG GM Noise SAG GM Noise K SAG GM Noise 0.0 K (a) S1 [ =, = 1] (b) Gowalla [ =, = ] (c) TIGER [ =, = ].0.0 SAG GM Noise.0.0 SAG GM Noise K K K () Image [ =, = ] (e) Ault-num [ =, = ] (f) Lifesci [ =, = ] Figure : The analysis of the GM Approach. x-axis: bloc size exponent in log-scale, y-axis: NICV in log-scale Gene Lloy Gene Lloy Gene Lloy (a) S1 [ =, = 1] (b) Gowalla [ =, = ] (c) TIGER [ =, = ] Gene Lloy Gene Lloy Gene Lloy () Image [ =, = ] (e) Ault-num [ =, = ] (f) Lifesci [ =, = ] Figure : The comparison of the convergence rate of the genetic algorithm base -means an Lloy algorithm. x-axis: number of iterations in log-scale, y-axis: NICV in log-scale. PROPOSITION 1. In one roun of DPLloy, the MSE is ( ) (t) Θ. (Nǫ) PROOF. Let us first consier the MSE on the i-th imension. (Si + S MSE(ô i) = E[ i [ (C Si ) S E i C ] C S i C+ C C ) ] = E[( S i) ] + E[S C i ( C) ] + CS ie[ S i C] C C = Var( S i) C + S i Var( C) C The last step hols, because S i an C are inepenent zeromean Laplacian noises an the following formulas hol: E[ S i C] = 0 E[( S i) ] = E[( S i) ] (E[ S i]) = Var( S i) E[( C) ] = E[( C) ] (E[ C]) = Var( C), wherevar( S i) anvar( C) are the variances of S i an C, respectively. Suppose that on average S i = ρ, where[ r,r] is the range of r C the i th imension. That is, ρ is the normalize coorinate of i-th imension of the cluster s centroi. Furthermore, suppose that each cluster is about the same size, i.e.,c N. Then,MSE(ôi) can be

10 approximate as follows: MSE(ô i) N ( Var( Si)+(βr) Var( C) ) () DPLloy as ) to each sum/count function Laplace noise Lap. Therefore, both Var( S i) an Var( C) are ( (r+1)t ǫ equal to ((r+1)t). From Equation () we obtain ǫ MSE(ô i) ( Var( Si)+(ρr) Var( C) ) N ( ) t(r +1) = (1+(ρr) ). Nǫ As the noise ae to each imension is inepenent, from Equation we now that the MSE is MSE(ô) = ( ) MSE(ô i) (1+(ρr) t(r+1) ) () Nǫ i=1 ( Whenr is a small constant, this becomes Θ (t) ). (Nǫ) Proposition 1 shows that the istortion to the centroi proportional to t, while inversely proportional to (Nǫ). At first glance, this analysis seems to conflict with the experimental result in Figure (a), where DPLloy is much less scalable to than to. The reason behin is that the performance of DPLloy is also affecte by the fact that rouns are not enough for it to converge. When increases, converging taes more time, an it is also more liely that choices of initial centrois lea to local optima that are far from global optimum.. Error Stuy of EUGM Non-interactive approach partitions a ataset into a gri of M uniform cells. Then, it releases private synopses for the cells, an runs -means clustering on the synopses to return the cluster centrois. Similar to the error analysis for DPLloy, we analyze the MSE. Let o be the true centroi of a cluster, an ô be its estimator compute by a non-interactive approach. The MSE between ô an o is compose of two error sources. First, the count in each cell is inaccurate after aing Laplace noise. This results in the variance (i.e.,var(ô)) of ô from its expectation E[ô]. Secon, we no longer have the precise positions of ata points, an only assume that they occur at the center in a cell. Thus, the expectation of ô is not equal too, resulting in a bias (i.e.,bias(ô)). The MSE is the combination of these two errors. MSE(ô) = Var(ô)+(Bias(ô)) () Analyzing the variance. We assume that each cluster has a volume that is 1 of the total volume of the ata space, an has the shape of a cube. In -imensional case, the with of the cube is w = r. Suppose that the geometric center 1 of the cube is τ i. Let T be the set of cells inclue in the cluster. For each cell t T, we use c t to enote the number of tuples int,t i to enote the i th imension coorinate of the center of cell t, an ν t to enote the noise ae 1 Note that this is not the cluster centroi. to the cell size. Let ô i be the i-th imension of the noisy centroi. Then, the variance of ô i is Var(ô i) = Var(ô i τ i) ( ) = Var t T t i(c t+ν t) τ t T (ct+νt) i ( t T = Var (t i τ i )(c t+ν t) 1 C t T t T (ct+νt) ) ( (ti τ i) Var(c t +ν t) ). In the above, the first step follows because τ i as the cube geometric center is a constant. The last step is erive by assuming t T (ct+νt) C, that is, the noisy cluster size is approximately equal to the original cluster size C. We can see that within the cube, ifferent cells contribution to the variance is not the same. Basically, the closer a cell is to the cube center, the less its contribution. The contribution is proportional to the square istance to the cube center. We thus approximate the variance as follows: Var(ô i) 1 w ( ) M x x C w (r) w 1 ǫ = Mr C ǫ + In the above integral, x in the first term is the istance from a cell center to the cube center (i.e., t i τ i). The secon term M is (r) the number of cells per unit volume, anw 1 is the volume of the ( 1)-imensional plane that has a istance of x to the cube center. The last term is the variance of the cell size (i.e.,var(c ǫ t +ν t)). Suppose that clusters are of equal size, that is, C = N. Then, the variance of the noisy centroi by summing all the imensions is. Var(ô) Mr () N ǫ ) The analysis shows that the variance of the EUGM is propor- M tional to. EUGM sets M to ( Nǫ (Nǫ) +. Plugging it into Equation, we get that the variance of EUGM is inversely proportional to(nǫ) +. Analyzing the bias. Let x i be the i th imension coorinate of a tuple x. Then, the bias ofô i is = E Bias(ô i) = E[ô i] o i [ ] t T t i(c t+ν t) t T (ct+νt) t T x t (t i x i ), C t T x t x i t T ct where the last step is evelope by approximating t T (ct +νt) to the cluster size C. The bias evelope in the above formula is epenent on ata istribution. Its precise estimation requires to access real ata. We thus only estimate its upper boun. Let q i = t i x i. Noninteractive approach partitions each imension into M intervals of equal length. Hence, q i falls in the range of [ r r, ], an M M r the upper boun of Bias(ô i) is. Summing all the imensions, we obtain the upper boun of square bias of noisy M centroi (Bias(ô)) r. () M

11 The estimation shows that the upper boun of square bias ecreases as a function of M. This is consistent with the expectation. As M increases, the ata space is partitione into finergraine cells. Therefore, the istance between a tuple in a cell to the cell center ecreases on average. Comparing DPLloy an EUGM. We now analyze the performance of DPLloy an EUGM in Figure 1. Equation shows that the MSE of DPLloy is inversely proportional to (Nǫ). The MSE of EUGM consists of variance an square bias. Plugging ) M = ( Nǫ + into Equation an Inequality, it follows that the MSE of EUGM is inversely proportional to (Nǫ) +. This explains why the NICV of DPLloy, which is inversely proportional to (Nǫ) rops much faster than that of EUGM as ǫ grows. It also explains why DPLloy has better performance on big ataset (e.g., the TIGER ataset). The MSE of EUGM is inversely proportional to (Nǫ) +. Thus, it increases exponentially as a function of. Instea, from Equation, it follows that the MSE of DPLloy has only cubic growth with respect to. Therefore, in Figure 1, as the imensionality of ataset increases, DPLloy outperforms EUGM. This also explains in Figure why DPLloy is more scalable to than EUGM.. The Hybri Approach Our hybri approach combines EUGM an DPLloy. Given a ataset an privacy buget ǫ, the hybri approach first checs whether it overtaes the DPLloy metho an also the EUGM metho. If this is not the case, the hybri approach simply falls bac to EUGM. Otherwise, the hybri approach allocates half privacy buget to EUGM to output a synopsis an fin intermeiary centrois that wor well for the synopsis. Then, it runs DPLloy for one iteration using the remaining half privacy buget to refine these centrois. We use MSE to heuristically etermine the conitions, on which the hybri approach overtaes the DPLloy metho an also the EUGM metho. Basically, we require that the MSE of the hybri approach be smaller than those of the other two approaches, since smaller MSE implies smaller error to the cluster centroi. From Equation, it follows that the MSE of DPLloy with full privacy buget is ( ) (1+(ρr) t(r+1) ). (11) Nǫ A precise estimation of the MSE of the EUGM metho requires to access the ataset, since the bias epens on the real ata istribution. However, we have the approximate variance (Equation ) by settingm = ( Nǫ ) +. r () () +(Nǫ) + (1) One-iteration DPLloy with half privacy buget outputs the final cluster centrois, if it is applie in the hybri approach. Therefore, we approximate the MSE of the hybri approach by that of the oneiteration DPLloy ( ) (1+(ρr) (r +1) ), (1) Nǫ which is evelope by setting t = 1 an privacy buget to 0.ǫ in Equation. Comparing Formulas 11 an 1, it follows that the MSE of the hybri approach is lower than or equal to that of the DPLloy if t. (1) Variance is the lower boun of MSE. Thus, if the MSE of the hybri approach is equal to or smaller than the variance of the EU- GM metho, then it is sure that the hybri approach has lower MSE. Setting Formula 1 smaller than or equal to Formula 1 yiels where an ǫ ( )+ X, (1) Y ( ) X = (1+(ρr) (r +1) ), N Y = r () () +N + Inequalities 1 an 1 give the conitions of applying the hybri approach. Inequality 1 is automatically satisfie since DPLloy runs for t = iterations.. Experimental results We now compare the hybri approach with EUGM an DPLloy. The configuration for EUGM an DPLloy is the same as in Section.1. For the hybri approach, we run EUGM times to output sets of intermeiate centrois. Then we run DPLloy times on each intermeiate result. We finally report the average of 0 NICV values. Figure gives the results on the six external atasets. In low imensional atasets (S1, Gowalla, TIGER, an Image), the hybri approach simply falls bac to EUGM for small ǫ value. When ǫ increases, both the hybri approach an EUGM converge to the baseline with the former having slightly better performance. For example, in the Gowalla ataset for ǫ = 0., the average NICV of the hybri approach is 0.01 an that of EUGM is0.01. In higher imensional atasets (Ault-num an Lifesci), the hybri approach outperforms the other two approaches in most cases. It is worse than DPLloy only for a few smallǫvalues, on which it falls bac to EUGM. There are two possible reasons. The first is that the MSE analysis assumes that atasets are well clustere an each cluster has equal size, but the real atasets are sewe. For example, the baseline approach partitions the Ault-num ataset into clusters, in which the biggest cluster contains 1, tuples an the smallest contains,10 tuples. The secon is that we use the variance of EUGM as the lower boun of its MSE. Thus, it is possible that the MSE of the hybri approach (approximate by the MSE of one-iteration DPLloy with half privacy buget) is larger than the variance of EUGM, but actually smaller than its MSE. In such cases, the hybri approach gives lower NICV if it oes not fall bac to EUGM. For example, on the Ault-num ataset for ǫ = 0.0, the hybri approach of falling bac to EUGM has the NICV of 0.0, while its NICV is 0., if it applies EUGM plus one-iteration of DPLloy. We also evaluate the approaches using the synthetic atasets as generate in Section.1. Figure clearly shows that the hybri approach is more scalable than EUGM with respect to both an. This confirms the effectiveness of the hybri approach. Figure presents the runtime of DPLloy an EUGM on the six external atasets. We follow the same experiment configuration as. 11

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Non-homogeneous Generalization in Privacy Preserving Data Publishing W. K. Wong, Nios Mamoulis an Davi W. Cheung Department of Computer Science, The University of Hong Kong Pofulam Roa, Hong Kong {wwong2,nios,cheung}@cs.hu.h