Non-homogeneous Generalization in Privacy Preserving Data Publishing

Size: px

Start display at page:

Download "Non-homogeneous Generalization in Privacy Preserving Data Publishing"

Adele Roberts
6 years ago
Views:

1 Non-homogeneous Generalization in Privacy Preserving Data Publishing W. K. Wong, Nios Mamoulis an Davi W. Cheung Department of Computer Science, The University of Hong Kong Pofulam Roa, Hong Kong ABSTRACT Most previous research on privacy-preserving ata publishing, base on the -anonymity moel, has followe the simplistic approach of homogeneously giving the same generalize value in all quasi-ientifiers within a partition. We observe that the anonymization error can be reuce if we follow a non-homogeneous generalization approach for groups of size larger than. Such an approach woul allow tuples within a partition to tae ifferent generalize quasi-ientifier values. Anonymization following this moel is not trivial, as its irect application can easily violate - anonymity. In aition, non-homogeneous generalization allows for aitional types of attac, which shoul be consiere in the process. We provie a methoology for verifying whether a nonhomogeneous generalization violates -anonymity. Then, we propose a technique that generates a non-homogeneous generalization for a partition an show that its result satisfies -anonymity, however by straightforwarly applying it, privacy can be compromise if the attacer nows the anonymization algorithm. Base on this, we propose a ranomization metho that prevents this type of attac an show that -anonymity is not compromise by it. Nonhomogeneous generalization can be use on top of any existing partitioning approach to improve its utility. In aition, we show that a new partitioning technique tailore for non-homogeneous generalization can further improve quality. A thorough experimental evaluation emonstrates that our methoology greatly improves the utility of anonymize ata in practice. Categories an Subject Descriptors H.2.7 [Database Management]: Security, integrity, an protection General Terms Algorithms Keywors Non-homogeneous generalization, privacy, anonymization Supporte by grant HKU 71518E from Hong Kong RGC. Permission to mae igital or har copies of all or part of this wor for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page. To copy otherwise, to republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee. SIGMOD 1, June 6 11, 21, Inianapolis, Iniana, USA. Copyright 21 ACM /1/6...$ INTRODUCTION The problem of privacy-preserving ata publishing has been extensively stuie since it was first introuce in [2, 21]. Consier a large table which has to be release to the public for research purposes. Privacy is typically compromise by careless publishing of the table [3], since sensitive information may be leae. Thus, the goal of ata publishing is to transform the table, such that iniviuals may not be line to specific tuples with high certainty. At the same time, the publishe ata shoul still be useful, so an optimization problem arises: anonymize the ata such that a certain egree of privacy is preserve while ata utility is maximize. In the table to be publishe, apart from the eys that are suppresse before publication, there is a set of attributes calle the quasi-ientifier (QID). The QID of each tuple is nown to the attacer an may be use to ientify an iniviual. A typical example of QID is {ZIP coe, gener, ate of birth}, which can uniquely ientify 63% of the population i US Census ata [8]. The popular -anonymity principle [2, 21] requires that the probability of an aversary being able to fin out the ientity of an anonymize tuple is at most 1. The most common technique for achieving - anonymity is generalization [13, 14, 1, 6]. The table is ivie into groups having tuples or more an the QID values in each group are generalize to a range containing all original values. Table 2 shows an exemplary 2-anonymize table using generalization. The original ata are shown in Table 1. (t i in Table 2 is the generalize version of t i in Table 1 for easy reference.) For example, the age of t 3 is originally 15 an after generalization, it is replace by the range Apart from microata publication, -anonymity has been largely aopte in applications lie location-base services [19, 11], to protect the ientity of query issuers. A wie range of algorithms using generalization are propose for aressing -anonymity [1, 14, 6]. They share a common framewor: first partition the tuples into groups, then assign the same generalize QID to tuples in the same group. The group of tuples with the same QID is calle equivalence class. Such an approach, to which we refer as homogeneous generalization, raises an important question: oes generalization have to be homoge- QID Sens. attribute Tuple ID Zip coe Gener Age Disease t M 3 Flu t F 28 Cancer t M 15 Cancer t M 48 AIDS t M 2 None Table 1: Original table

2 Tuple ID Zip coe Gener Age Disease t 1 91*** * 15-3 Flu t 2 91*** * 15-3 Cancer t 3 91*** * 15-3 Cancer t 4 923** M 2-48 AIDS t 5 923** M 2-48 None Table 2: 2-anonymity using homogeneous generalization Tuple ID Zip coe Gener Age Disease t * * 28-3 Flu t 2 91*** * Cancer t 3 91*** M 15-3 Cancer t 4 923** M 2-48 AIDS t 5 923** M 2-48 None Table 3: 2-anonymity using non-homogeneous generalization neous? For example, consier the possible publication of Table 1, as shown in Table 3. t 1, t 2 an t 3 have a ifferent generalize QID. This generalization is non-homogeneous. Assuming the aversary nows the QIDs of all iniviuals containe in Table 1, he can fin out the ientity of any anonymize tuple in Table 3 with probability at most 1. Hence, 2-anonymity is satisfie, as this is 2 also the case for Table 2. On the other han, if we compare the utility of the two tables, we can observe that Table 3 is better than Table 2, regarless of the utility measure use; for each tuple an QID attribute of Table 3, the generalize range is smaller than or equal to the corresponing range in the corresponing tuple an attribute in Table 2. This example shows that it is possible to achieve higher utility using non-homogeneous generalization. The iea of non-homogeneous generalization was first introuce in [7], which stuies techniques with a guarantee that an aversary cannot associate a generalize tuple to less than iniviuals. However, the propose solutions o not offer bouns for the probability of each association. Hence, some iniviuals may have higher probability to be associate to an anonymize tuple than others an this may lea to privacy breaches. In this paper, we systematically stuy the use of non-homogeneous generalization in anonymizing tables. We provie a methoology for verifying whether a non-homogeneous generalization violates -anonymity. Then, we propose a technique that generates a non-homogeneous generalization an show that its result satisfies -anonymity, however by straightforwarly applying it, privacy can be compromise if the attacer nows aitionally the anonymization algorithm. Base on this, we propose a ranomization metho that prevents this type of attac an show that -anonymity is not compromise by it. Although non-homogeneous generalization can be use on top of any existing partitioning approach to improve its utility, we show that a new partitioning technique tailore for non-homogeneous generalization can further improve quality. Our main focus throughout the paper is -anonymity, however, we also iscuss how our methoology can be extene to improve utility for other privacy principles. A thorough experimental evaluation emonstrates that our methoology greatly improves the quality of anonymize ata in practice. The rest of the paper is organize as follows. The next section reviews relate wor an positions it against this paper. In Section 3, we formally efine the -anonymity problem an provie a partial orering mechanism for comparing the utility of ifferent anonymization results. Section 4 iscusses the main challenges of non-homogeneous generalization an provies some properties that can be use to efine a goo generalization. Our methoology is escribe in Section 5. Section 6 iscusses the extension of our methoology for l-iversity [16] an in Section 7 we experimentally evaluate it. Finally, Section 8 conclues the paper. 2. RELATED WORK A privacy principle, -anonymity, is evelope in [2, 21] to guar against aversaries having the QIDs of iniviuals as bacgroun nowlege. The goal of -anonymity is to prevent an aversary from ientifying an iniviual with a probability higher than 1. Generalization an suppression are use to protect privacy. Generalization replaces the exact QID value by a less concrete form. For example, value 15 is generalize to range [15-3]. Suppression removes some values or the entire tuple from T. The most popular metho stuie by the community for -anonymity has been homogeneous generalization. The tuples in the table are partitione into groups calle equivalence classes. The QID of tuples in the same equivalence class are generalize to be the same. As fining the best partitioning that achieves -anonymity, while maximizing utility is NP-har [18], ifferent fast heuristics are evelope. These can be classifie into two approaches: (i) global recoring (e.g., [14, 13]): if any two tuples have the same QID value, they must tae the same generalize QID; (ii) local recoing (e.g., [1, 28, 6]): two tuples having the same QID may be generalize ifferently. Local recoring generally gives publishe tables of higher utility, ue to its flexibility. Apart from -anonymity, there are other principles (e.g., [16, 26, 15, 25, 27, 24, 22, 17]) that target ifferent privacy concerns an/or ifferent aversary assumptions. The relation to be anonymize typically contains a sensitive attribute. Even if the aversary cannot associate the tuples in the publishe table with iniviuals (i.e., -anonymity is satisfie), he may associate an iniviual to a particular sensitive value with high probability if there are multiple occurrences of the same sensitive value in the equivalence class where the QID of the iniviual belongs. For example, suppose Bob is a male of age 15 (t 3 in Table 1). Although there are 3 possible tuples {t 1, t 2, t 3} for Bob in Table 2, an attacer can erive that Bob is liely (with the high probability of 2 ) to have cancer. The 3 l-iversity principle [16, 26] aims to boun the maximum of this inference probability to be 1. The t-closeness principle [15], on l the other han, aims to control the inference probability so that it is similar to the general istribution of the sensitive values. For example, if 9% of population in T o not smoe, the goal is to ensure that 9% of iniviuals in each equivalence class are non-smoing. Both l-iversity an t-closeness assume the same basic aversary capability: nowing the QID of iniviuals. Some wors assume that an aversary may obtain aitional bacgroun nowlege. For instance, [24] assumes the aversary may now the algorithm use in generalization. In [22], the aversary may corrupt some iniviuals, obtain the sensitive values of them, an use them to infer the remaining sensitive values in the equivalence class. Nevertheless solutions to all above problems may suffer from privacy breaches; [12] has emonstrate how to breach privacy using e- Finetti s theorem. Perturbation [1, 5] is another technique that has been use to preserve privacy in ata publishing. Recent wors also use a hybri approach, combining perturbation an generalization to preserve privacy [22]. In perturbation, noise is ae to the original ata, such that the resulting values ranomly eviate from the original ones. Compare to generalization, Perturbation may introuce high error, especially for aggregate queries with small ranges. In aition, noise filtering techniques may be use to breach privacy [9].

3 The closest piece of wor relate to ours is [7], where nonhomogeneous generalization is introuce. The principle of global (1, )-anonymity is propose, which guarantees that an iniviual is not associate to less than generalize tuples. In aition, a generalization technique for global (1, )-anonymity is evelope. However, this wor suffers from two major rawbacs. First, the principle oes not ensure that an aversary associates an iniviual to at least tuples with even probability. As a result, an anonymize tuple may have probability > 1 to be associate with an iniviual; thus, global (1, )-anonymity has a weaer privacy level compare to -anonymity. Secon, the propose algorithm has a high complexity of O( 2.5 ) an is thus not suitable in practice. In this paper, our goal is to evelop a methoology for nonhomogeneous generalization, which improves utility while maintaining an aequate level of privacy. Our stuy is mainly focuse on the basic -anonymity moel, the reasons being that: (i) algorithms for -anonymity are simple an many wors (e.g., [16]) have aapte them for ifferent principles; (ii) -anonymity is commonly use in applications lie location-base services [19, 11], where there are no aitional (sensitive) attributes. Apart from the basic -anonymity moel, we also consier the scenarios with stronger aversaries, with nowlege of generation algorithm (Section 4.2) an ientities of some generalize tuples (corruption, Section 5.2) 3. PROBLEM DEFINITION Consier a relational table T, in which there are 3 classes of attributes: (i) attributes that are eys in T : such attributes are remove in the publishe table to prevent immeiate ientification of iniviuals; (ii) attributes that are part of the quasi-ientifier (QID): the QID of every iniviual is nown to the attacer as bacgroun nowlege an can be use to lin tuples in the table to iniviuals; (iii) attributes that are not part of a ey or QID: the values of such attributes are retaine in the publishe table. Our goal is to generate a publishable table T such that (i) the -anonymity privacy constraint is satisfie; an (ii) utility is maximize. In Sectio.1, we escribe our assumptions about the aversary an efine -anonymity. In Sectio.2, we escribe how the utility of ifferent anonymize tables can be compare. 3.1 Aversary assumption an -anonymity We assume an aversary may obtain the value of QID an the ientification of any tuple in T by sources other than T (e.g., a public voters table). Let H be the aversary s nowlege containing the QID an ientity of all nown iniviuals. In the worst case, the aversary may have access to the QID of every iniviual, thus by joining H an T on QID, tuples t in T may be line to iniviuals. -anonymity aims at preventing the aversary from fining an iniviual s ientity with a probability higher tha. DEFINITION 1. (Ientity notion) Given two tables H, T, if a tuple t i H an a tuple t j T belong to the same iniviual, we say they have the same ientity, enote as t i = I t j. DEFINITION 2. (-anonymity) Given a table T, assume that H is the projection of T on ey an QID attributes. We say - anonymity is preserve in an anonymize table T if t i H, t j T, Pr(t i = I t j) Measuring an comparing utility Measuring the utility of an anonymize table is usually one by means of an objective information loss measure that compares T with T. Popular measures inclue the iscernibility metric [4], which sums the squares of the equivalence class carinalities, the normalize certainty penalty (NCP) [28] which is efine by the sum of QID attribute ranges in each equivalence class, an the global certainty penalty (GCP) [6], which is a normalize version of NCP. In Section 5.3, we provie a efinition of GCP, which we use in this paper, as it affects the functionality of our ata partitioning algorithm. In general, the utility of the anonymize ata may not be easily capture by specific measures, as it epens on the application of the publishe ata. Our purpose is not to limit our stuy to a particular utility metric but to evelop a new methoology, which generally improves the utility of existing methos that apply homogeneous generalization. As generalization converts precise ata to uncertain ata, a metho that restricts the uncertainty of each tuple compare to the result of another metho is certainly better. Definitio formally states when one anonymize table Ta is strictly better than another Tb in terms of utility (enote by Ta Tb ); we can use it to efine a partial orer for anonymization results. In this paper, we aim at fining a local-optimal solution T, i.e., Ti such that Ti T, Ti violates -anonymity. DEFINITION 3. (utility-base orering) Consier two anonymize tuples t 1, t 2. We say that t 1 preserves a better utility than t 2, enote by t 1 t 2 if for all attributes i in the QID, t 1[i] t 2[i], an there is at least one attribute j in the QID, for which t 1[j] t 2[j], e.g., 9115*,*,28-3 (t 1 in Table 3) 91***,*,15-3 (t 1 in Table 2). Consier two anonymize tables Ta, Tb, both with n tuples, such that tuples Ta [i] an Tb [i] originate from T [i], for all i [1, n]. We say Ta preserves a better utility than Tb, enote by Ta Tb, if i [1, n], Ta [i] Tb [i] or Ta [i] = Tb [i] an j [1, n], such that Ta [j] Tb [j]. As iscusse earlier, -anonymity can be achieve by first partitioning the ata into groups an then uniformly transform the QIDs of all recors in the same group to tae the same generalize value. Homogeneous generalization may not prouce results of the highest possible quality. In the next section, we iscuss the challenges of a non-homogeneous generalization technique that coul be applie alternatively. 4. CHALLENGES IN NON-HOMOGENEOUS GENERALIZATION Assume that the bacgroun nowlege of the aversary is a table H containing the QID of every iniviual. Given a publishe table T, the aversary performs a lining attac by joining T with H; for each tuple t in H, the aversary fins all tuples t in T such that t[qid] is inclue in the generalize t [QID]. For example, if the QID of t T is 1 an there is a tuple t T with t [QID] = [5-2], then t is a possible generalization of t. We call the pair t, t a match. A vali assignment is a maximal 1-to-1 assignment between tuples of H an T. In this paper, we ientify two challenges to non-homogeneous generalization. We will iscuss the issue of ineffective matches when joining H an T in Section 4.1 an how these can be ientifie an eliminate. In Section 4.2, we will iscuss a privacy threat, for the case where the aversary nows the algorithm, which is use to generate the anonymize table. 4.1 Pruning of ineffective matches Intuitively, -anonymity can be satisfie if an aversary fins that there are at least matches relate to the same tuple in H/T. This is the case for homogeneous generalization; given a generalize QID, any of the or more tuples from the original table T that

4 Tuple ID QID t 1 1 t 2 2 t 3 3 t 4 4 t 5 5 (a) original table Tuple ID QID t t t t t (b) anonymize table n 5 Table 4: Non-homogeneous generalization n 4 were groupe together an match that QID, have the same probability (at most 1/) to match any tuple in T with that generalize QID. However, the same oes not apply in the non-homogeneous case. Consier the original table T shown in Table 4a an an anonymize table T using non-homogeneous generalization, as shown in Table 4b. For every t i in T, there are at least two matches in T an vice versa. However, 2-anonymity is not satisfie. Both t 1 an t 5 in T match t 1, an t 5 in T. t 2 matches t 1, t 2 an t 5 in T. Since t 1 an t 5 must be either t 1 or t 5, t 2 can only be matche to t 2 in a vali assignment, violating 2-anonymity. We say that the match of t 2 to t 1 (an t 5) is ineffective if an aversary can eliminate such a possibility. DEFINITION 4. (Match an assignment) Given a table T an its anonymize table T, a match m is a 2-tuple t i, t j where t i T, t j T an the QID of t i is inclue in that of t j. An assignment a is a set of matches m i = t xi, t y j, where t xi T, t y j T, a = T, an for each pair of matches m i a, m j a, x i x j an y i y j. DEFINITION 5. (Effective match) Given a table T an its anonymize table T, let H be the projection of T on ey an QID attributes. Given two tuples t i H, t j T, a match m = t i, t j is sai to be effective, if an only if there exists an assignment a such that m a. An assignment represents the scenario that an aversary gives a unique ientity to every anonymize tuple in T an vice versa. If a match cannot be foun in any of the assignments, then the aversary will happily remove this match. If all ineffective matches are remove an there are less than matches left for a tuple in H, -anonymity is violate. In the following, we first iscuss how to etermine if a match is effective (Section 4.1.1). Then, we present the property that the generalize table shoul satisfy in orer for all matches to be effective (Section 4.1.2) Necessary conition for effective match In orer to satisfy -anonymity, we must have at least effective matches for each tuple in T. In orer to etermine if a match is effective, we use an assignment graph which is use to visualize the matches. DEFINITION 6. (Assignment graph) Consier a table T an its anonymize table T. Assume that the tuples in both tables are orere, such that the assignment a = { t i, t i } is vali, for all t i T, t i T. An assignment graph G = (V, E) is a irecte graph with T vertices. For i = 1 to T, n i V represents t i T an t i T, An ege n i n j is present if an only if the QID of t i is inclue in that of t j. Figure 1 shows the assignment graph constructe for Table 4b. Each ege in the assignment graph represents a match that can be foun by joining T an T. For example, the ege from to n 4 means that t 3 in T joins to t 4 in T. Next we show how to verify the effectiveness of a match in the assignment graph. Figure 1: Assignment graph of Table 4b THEOREM 1. Consier a table T, its anonymize table T, an the corresponing assignment graph G = (V, E). The match t i, t j (corresponing to ege (n i, n j) E) is effective if an only if n i is reachable from n j. PROOF. If part. Note that if a match is not effective, we cannot fin an assignment containing the match. So, we can prove the statement by showing how to construct an assignment that contains the target match t i, t j. Without loss of generality, we assume j > i an a path from n j to n i is {n j, n j 1,..., n i+1, n i}. Note that a possible assignment is a = { t i, t i }. For > j an < i, the match t, t is ae to assignment a. For = j to i + 1, since there is an ege from t to t 1, we can a t, t 1 to a. Finally, we inclue our target match t i, t j to a. Hence, an assignment containing the target match is prouce. Only if part. Given an effective match t i, t j, there is an assignment a that contains this match. Each match in a is an ege in G, thus a is a subset of E. We now show that we can fin a path from n i to n j using the matches in a. Consier a subgraph G = (V, a ) that only contains matches in a. Each noe in G has exactly one outgoing ege an one incoming ege. Hence, G must be compose of cycles. The match t i, t j is represente by the ege (n i, n j) which lies on a cycle as well. So, n i is reachable from n j by traveling through the cycle. For example in Figure 1, is not reachable from. This means the match of t 2, t 1 is not effective an cannot appear in a vali assignment. Thus, all ineffective matches can be ientifie an remove from the assignment graph by the aversary. This results in a reuce assignment graph. Figure 2 shows the graph erive from the initial one shown in Figure 1, containing only the effective matches. This gives a clearer picture why Table 4b oes not satisfy 2-anonymity, as t 2 can only be mappe to t 2 in a vali assignment. n 5 n 4 Figure 2: Effective matches in the graph of Table 4b

5 4.1.2 Impact of effective match on generalization From Theorem 1, we now that the effectiveness of a match can be etermine by looing at the connectivity of noes in a graph. In fact, if we eep only effective matches, the graph will egenerate to the set of its strongly connecte components. THEOREM 2. Consier a table T, its anonymize table T, an the corresponing assignment graph G = (V, E). If all matches are effective, G is a set of strongly connecte components, such that there are no eges between any two components. PROOF. A graph can always be ecompose to a number of strongly connecte components. We prove the theorem by showing that each component in G is inepenent, i.e., there is no ege between any two components. We prove the statement by contraiction. Without loss of generality, we assume C 1 an C 2 are two components in G an there is a path from u to v where u C 1 an v C 2. Thus any noe x C 1 can reach any noe y C 2 via the path from u to v. Since there are effective matches only, there must be a path from v to u ue to Theorem 1. Hence, from every noe y C 2, we can reach any noe x C 1. C 1 an C 2 are a single strongly connecte component, which contraicts the assumption. Theorem 2 leas to an interesting observation: tuples are partitione to strongly connecte components in a non-homogeneous generalization. Note that the complexity of fining the strongly connecte components is linear in the number of eges in G, ue to Tarjan s algorithm [23]. 4.2 Ranomization in generalization With non-homogeneous generalization, the generalize QID of tuples in a partition (i.e., equivalent class) may vary. For example, in Table 3, t 1, t 2 an t 3 have a unique generalize QID. This offers aitional information to an aversary in his quest for the ientities of the anonymize tuples. If we use a eterministic nonhomogeneous generalization approach, the generalize value of each tuple in the table woul be the same for every possible run of such a metho. Therefore if the aversary nows the generalization algorithm, he can apply it on H, compare the result with the anonymize table an infer the original QID of the anonymize tuples an therefore their ientities. Due to this problem, ranomization is necessary when anonymizing a table with non-homogeneous generalization. A goo ranomization technique implies that when an aversary fins tuples in H joine with an anonymize tuple t i in T, the probability of each of these tuples being the real ientity of t i is the same (= 1 ). We can achieve this goal by first computing the generalize QID of tuples eterministically an then assigning each generalize QID to a tuple in T in a ranomize way. Figure 3 is an example, illustrating this process for 2-anonymity. First, we generate for each original tuple t i in T a generalize QID t i, which contains t i an 1 aitional ones from T. The QID generation function, enote by gen, taes as input a set of tuples an gives a generalize QID range. Next, for each generalize QID, we assign a ranom ientity to it with a probability of 1. In the example, we have pice t 2 = I t 1, t 3 = I t 2, an t 1 = I t 3. The other attributes are copie to the anonymize table accoringly. Thus, the generalization proceure is ivie into two steps: (i) generalize QID generation; (ii) ranom assignment generation. The QID generation etermines whether -anonymity can be achieve an affects the possible assignments that we can choose from in the ranomization. For example, if the QID generation for Table 4a is one as shown in Table 4b, it is not possible to achieve ID t 1 t 2 t 3 gen({t 1,t 2 }) gen({t 2,t 3 }) gen({t 1,t 3 }) Table T QID <1, 1, 2> <1, 2, 1> <2, 1, 1> ID t 1 t 2 t 3 Other a b c QID generation QID <1, 1-2, 1-2> <1-2, 1-2, 1> <1-2, 1, 1-2> ID t 1 t 2 t 3 Anonymize table T * <1, 1-2, 1-2> <1-2, 1-2, 1> <1-2, 1, 1-2> Ranom assignment Possible matches t 1 t 2 t 1 QID t 2 t 3 t 3 Other Figure 3: The generalization process in achieving 2-anonymity -anonymity, as we have shown in Section In the following subsection, we will iscuss how we can guarantee -anonymity in the QID generation process A sufficient conition for -anonymity In orer to preserve -anonymity, we have alreay shown that a necessary conition is to have at least effective matches for each tuple. However, this conition alone cannot guarantee - anonymity. Consier the assignment graph shown in Figure 4. For simplicity, self-loops are omitte an reciprocal eges connecting the same pair of noes are merge to a single biirectional ege. K 3a an K 3b are complete graphs of 3 noes. Note that every ege in the graph represents an effective match an every noe has at least three incoming an outgoing eges. However, 3-anonymity cannot be achieve by ranomization on top of this assignment graph. For an ege n i where i 1, the path from n i to must go through ege n 5 n 6. So, if t 1 I t 1, we now that t 5 = I t 6. From this, we can raw a conclusion that either t 1 = I t 1 or t 5 = I t 6 is true (a probability of 1 to breach privacy). 2 K 3b n 8 n 7 n 4 n 6 n 5 K 3a Figure 4: An example of assignment graph that has 3 effective matches for each tuple but violates 3-anonymity (self-loops are remove for simplicity) In the above example, each of the possible assignments contains either the match t 1, t 1 or t 5, t 6. Thus, we can fin at most 2 assignments with no overlapping matches. In fact, if there are = 3 assignments with no overlapping matches, we can achieve -anonymity. First, we efine the concept of match-ifferent conition. b c a

6 DEFINITION 7. (match-ifferent assignments) Given two assignments a i, a j. a i is match-ifferent to a j if a i a j =. Having match-ifferent assignments, we can ranomly pic one of them as the resulting assignment of the ranomization process. For each anonymize tuple t j, there are ifferent possible ientities in total. Each ientity of t j is in a ifferent matchifferent assignment. Since each assignment has the same chance to be pice, t j is assigne to a particular tuple in T with the same 1 chance; hence, -anonymity can be achieve. To ensure that there are match-ifferent assignments in the set of generalize QID, we prove that it is sufficient that in the assignment graph with only effective matches each noe has the same number ( ) of incoming eges an outgoing eges. LEMMA 1. Consier an assignment graph with only effective matches, where each noe has outgoing eges an incoming eges. Given match-ifferent assignments a 1, a 2,..., a, where 1 < <, we can always fin an assignment a +1 such that a +1 is match-ifferent to a i for i = 1 to. PROOF. See Appenix A. 5. ANONYMIZATION USING NON-HOMO- GENEOUS GENERALIZATION In this section, we iscuss how we can generate a -anonymize table using non-homogeneous generalization, builing on the observations from the previous section. Although we can apply non-homogeneous generalization irectly on T, ue to the large scale of the ata, the high cost of the necessary ranomization, an the natural partitions that possibly exist in the ata, we first partition the ata into groups of or more tuples an then apply non-homogeneous generalization to each group. In a nutshell we follow the following framewor: 1. Divie the tuples into partitions 2. Generalize the QID of each tuple in each partition 3. Assign generalize QIDs to tuples, base on a ranom assignment We first iscuss how we generalize the QID of each tuple (step 2) in Section 5.1. Then, we explain our ranomization technique (step 3) in Section 5.2. Finally, we outline our partitioning metho (step 1) in Section Ring generalization Assuming that the ata are partitione, non-homogeneous generalization shoul be applie to each partition. In fact, we nee to etermine for each tuple, which 1 other tuples will be inclue in the generalization. Then, we can exten the QID of a tuple to inclue the QID of the other tuples. Let gen be a QID generalization function that taes as input a set of tuples an returns a generalize range on QID. For example, consiering Table 4a, gen({t 2, t 4, t 5}) = [2-5]. Let P S(t i) be the set of tuples in T that is use to prouce a generalize QID for t i. In the above example, P S(t 2) = {t 2, t 4, t 5}. Base on Lemma 1, we shoul have P S(t i) = P S(t j) for all i, j. In orer to minimize information loss in the generalize QIDs an satisfy -anonymity, we esign a generalization with P S(t i) = for all t i. Consier the set of tuples in a partition P an assume that the tuples are orere as t 1, t 2,..., t P. An easy way to construct P S(t i) is to assign the consecutive tuples gen(t i, t i+1,..., t i+ 1 ) to t i. (Note that if i + j > P, we use i + j P instea.) We call this ring generalization, as the assignment graph resulting from it loos lie a ring. Figure 5 illustrates the ring generalization for a partition with 5 tuples an = 3. The upper-left graph in Figure 6 is the corresponing assignment graph. Tuple in T P S(t i) Tuple in T t 1 t 1 t 2 t 3 t 1 = gen(t 1, t 2, t 3) t 2 t 2 t 3 t 4 t 2 = gen(t 2, t 3, t 4) t 3 t 3 t 4 t 5 t 3 = gen(t 3, t 4, t 5) t 4 t 4 t 5 t 1 t 4 = gen(t 1, t 4, t 5) t 5 t 5 t 1 t 2 t 5 = gen(t 1, t 2, t 5) Figure 5: Ring generalization for a partition with 5 tuples an = 3 n 5 n 5 n4 ring generalization n4 thir assignment n 5 n 5 n4 first assignment n4 secon assignment Figure 6: Generating three ranom assignments Note that every match in the ring generalization is effective, as it is part of a cycle. In aition, since each noe in the assignment graph represente by ring generalization has incoming eges an outgoing eges, we can fin match-ifferent assignments an -anonymity can be assure by ranomly picing one of them as the actual assignment (Lemma 1). The ring generalization is a local optimal solution, because we cannot remove any more eges from the graph. In aition, we can easily show that it can give equal or better utility compare to any homogeneous generalization. If the partition size is, the ring generalization egenerates to a homogeneous generalization, where all original tuples in the group match with all generalize tuples. If the partition size P is greater than, in the ring generalization, every tuple will match a set S of tuples as oppose to P in the homogeneous case. Since S P, an all utility metrics are monotonic to subset relationships, only better utility can be achieve by the ring generalization. LEMMA 2. Ring generalization gives a -anonymize partition of equal or better utility than that given by a homogeneous generalization on the same partition. An aitional benefit of this generalization is that, given a proper orering of the tuples (e.g., using Hilbert curves), the values in each generalize QID will be close to each other with high probability, maximizing the utility gain compare to a homogeneous generalization.

7 5.2 Ranomization In this section, we will escribe how we ranomly assign each generalize QID to a tuple in T (an replace the original QID of the tuple by the generalize one). An intuitive iea is to generate all possible assignments an pic one uniformly at ranom. Unfortunately, such an approach may violate -anonymity. For example, in a partition with 5 tuples, the generalize QIDs using ring generalization for 3-anonymity will have 13 ifferent possible assignments. Note that 13 is not ivisible by 3, meaning that some matches are containe in more assignments than other matches. This allows the aversary to infer these matches with probability higher tha/. In the example of Figure 5, matches t i, t i 1 have higher probability (5/13) than matches t i, t i an t i, t i 2 (with probability 4/13). As iscusse in Section 4.2.1, the solution is to efine matchifferent assignments, an ranomly select one of them. One easy construction of match-ifferent assignments is to set a i = { t i, t i j } for j = to 1. (Note that if i j < 1, we use i j + P instea.) Consier the example in Figure 5, an assume that we pic each column of P S(t i) as an assignment. By setting one of these assignment as the real assignment ranomly, there is a chance of 1 a column is chosen. Thus the probability Pr(ti =I t i j) is 1 an -anonymity is preserve. However, if we apply this approach, privacy can easily be compromise when the aversary nows the ientity of one anonymize tuple as bacgroun nowlege. In practice, such bacgroun nowlege can easily be acquire. For example, using the generalization of Figure 5, if an aversary nows that t 3 = I t 2, he nows that the secon column is the real assignment. Hence, he can fin out the ientities of all anonymize tuples, e.g., t 2 = I t 1. This type of attac is calle corruption an has been stuie in [22]. In orer to increase the resistance to corruption, the matchifferent assignments are efine ranomly. The generation process shares a similar framewor as the proof of Lemma 1. The pseuocoe of an algorithm that generates a ranom assignment is shown in Figure 7. In the followings, we briefly escribe the basic iea of the algorithm. The algorithm is run times to generate the assignments. At each run, it operates on the set of matches M that are not present in assignments generate in previous runs. It tries to fin a set of cycles in the graph that cover all noes by ranom wals an use them to efine an assignment. The cycles are foun incrementally, starting from an unassigne noe. After a cycle has been foun, all its noes are mare as processe an searching for a new cycle starts until all noes are processe, in which case the assignment is committe an returne. Cycles are not irectly committe in the assignment once foun, because some of them, when remove, may result in a graph where there oes not exist any cycle. Thus, while fining a new cycle, matches set by previous cycles may change. In aition, we limit each noe to be visite at most once in each ranom wal (by remembering the noes travele in U). The algorithm bactracs when it reaches a ea en. Lemma 1 guarantees correctness an termination. Figure 6 exemplifies three runs of the algorithm on the ring generalization of Figure 5 shown at the upper left of Figure 6. The soli eges show the matches that are chosen for the current assignment. For example, the first assignment, containing cycles n 4,, an n 5 n 5, will assign t 1 to t 4, t 4 to t 2, t 2 to t 1, t 3 to t 3, an t 5 to t 5. After the first run, the corresponing eges are remove an the algorithm is run again to generate the secon assignment. Regaring corruption, ring generalization gives a ( 1)-vertexconnecte assignment graph, i.e., the graph is still connecte after any 1 noes in the graph are remove. Assume an aver- Input: A partition of tuples P ; A set of anonymize tuples Q; A set of possible matches M (excluing matches alreay use in other assignments). Output: An assignment a M. 1. a = { t i, t i } // initial assignment (possibly invali) 2. L = P // L represents the set of unprocesse noes 3. While (L ) 4. // pic an ege to start a loop 5. Pic t i L at ranom 6. Pic t j Q ranomly such that t i, t j M 7. U = {t j} // U remembers the noes travele While (t i / U) // fin a cycle by ranom wal // t i is assigne to t j, so the one that is assigne 1. // to t j before has to fin another pair 11. Select t x where t x, t j a 12. Pic t y U ranomly where t x, t y M 13. if such t y oes not exist t j = t j s parent // bactracing else A t y to U an set t j as t y s parent En while 18. // a loop is foun 19. upate a an remove noes in the loop from L 2. En while 21. return a Figure 7: Algorithm for generating a ranom assignment sary obtains bacgroun nowlege about the ientity of a set of anonymize tuples Q, belonging to the same partition. The aversary can remove the corresponing noes from the assignment graph. If Q < 1, the assignment graph is still connecte, i.e., all matches are still effective. There are at least Q outgoing eges an Q incoming eges for each remaining noe. So, there are at least Q possible ientities for each anonymize tuple. Due to ranomization, an aversary cannot fin out the actual ientity of an anonymize tuple. Therefore, non-homogeneous generalization with ranomization offers a similar privacy protection to -anonymity corruption as homogeneous generalization Cost analysis an optimizations The cost in ranomization for a partition P is ominate by generating the match-ifferent assignments. When generating a new assignment, we maintain a list of unprocesse noes L (line 2). Then, we fin a cycle in the assignment graph which starts with a noe in L by a epth-first ranom wal. Note that each noe can be visite at most once. The complexity for that is O( V + E ) where V is number of noes an E is the number of eges in the graph. V = P an E = P in the assignment graph. Hence, it taes O( P ) to generate a ranom cycle. Note that each new cycle contains at least one noe in L. In the worst case, we nee P iterations to assign every noe in L. So, the overall cost to generate a match-ifferent assignment is O( P 2 ). Since P, ranomization becomes expensive for large values of, or when the partitions are very large compare to. Therefore a goo partitioning strategy shoul avoi generating huge groups. We now escribe two simple optimizations to reuce this cost in practice. In Section 7, we experimentally evaluate the cost of the optimize ranomization algorithm an show that it is bearable in practical cases, as it only epens on the size of the partitions an not the atabase size. Reucing the number of generate match-ifferent assignments. In the ranomization process, we first generate match-

8 ifferent assignments, an then choose one of them ranomly. Let a 1, a 2,... a be the generate assignments in orer. Since which assignment will be pice is inepenent of the generation process, we can first etermine which a i of the assignments will be pice to be the real assignment an then generate up to the i-th assignment. This will, on average, reuce half of the ranomization cost. Note that the assignments before the i-th shoul be generate, as they etermine which eges remain at the time of the generation of the i-th assignment. Generating an picing always the first assignment woul result in the selection of some matches with higher probability an is not acceptable, as iscusse in the beginning of Section 5.2. Using a ranom permutation to generate initial assignment. The goal of the algorithm in Figure 7 is to generate a ranom match-ifferent assignment. A fast Monte Carlo way is to use a ranom permutation. The resulting permutation may not be a vali assignment because some of the matches may not be in the set of possible matches M. However, some matches in the permutation may be vali. We use this as the initial assignment to the algorithm (line 1 in Figure 7). L is initialize to be the set of anonymize tuples that o not have a vali match. This reuces the initial size of L an hence the computational cost of assignment generation. 5.3 Partitioning In this section, we iscuss how to choose a goo partitioning strategy for non-homogeneous generalization. Before this, we will provie an appropriate measure for utility, which we aopt from previous wor. Base on our iscussion so far, value ranges are use to efine a generalize QID of tuples. For example, for a QID which is generate by the three values 15, 2, 48, we use range [15-48]. This format is compact an is easy to use in ata analysis, however, it introuces some unnecessary information loss, as values within the range but not present in the generating set of values are inclue in it. For example, value 17 is implicitly inclue in the generalize range [15-48]. In fact, the QID generalization moel that has the minimum information loss is the set representation, e.g., set {15, 2, 48} is use as the generalize QID. The set representation offers a significant improvement in utility an it is also more general, as it is appropriate for both orere an nominal attributes. Therefore, we aopt it in this paper an use it in subsequent iscussions. In aition, we use the Global Certainty Penalty (GCP) [6] as a measure for utility. DEFINITION 8. ( metric - GCP) Let t i be an anonymize tuple in anonymize table T using set representation. Let A be a QID attribute, A be the carinality of A, an count A(t i) be the number of istinct values of A in t i. The normalize certainty penalty NCP of t i on attribute A is NCP A(t i) = count A (t i ) 1 A 1. NCP(T ) = A QID t i T NCP A(t i), for the whole table T. Finally, GCP(T ) is efine as NCP(T ) T where is the number of attributes in the QID. As iscusse, similar to the homogeneous case, for scalability reasons we shoul ivie the tuples of T into partitions before applying non-homogeneous generalization to each of them. One option is to use an off-the-shelf partitioning metho for homogeneous generalization (e.g., [14]) an then apply our ring generalization at each partition. However, existing partitioning strategies may not be the most appropriate as they o not tae into account the use of non-homogeneous generalization. The main ifference between homogeneous an non-homogeneous generalization is that, in the former, it is always better to ivie a large group into two. For example, consier a set of four QID t 1 = 1, 1, 1, t 2 = 1, 1, 2, 8 A 1 t 3 = 2, 1, 2, t 4 = 1, 2, 1 an assume the omains of all QID attributes are the same. Suppose that we put partition t 1, t 2 into group G 1 an t 3, t 4 into another group G 2. t 1 an t 2 iffer in one value, whereas t 3 an t 4 iffer i values. Thus, with this grouping the value of NCP is. If we apply ring generalization irectly on the four tuples, without partitioning, the value of NCP is = 6. Hence, we obtain a higher information loss A 1 A 1 after we partition the tuples. In the non-homogeneous case, even if the partition size is much larger than, each generalize QID is generate from exactly tuples. Thus, low information loss can be still achieve in large partitions, as oppose to homogeneous generalization, which suffers from high information loss if the size of partitions is large. On the other han, the size of the partitions affects the cost of ranomization in our approach (see Section 5.2.1), therefore it shoul be controlle Partitioning base on lexicographical orer We now iscuss our partitioning strategy for non-homogeneous generalization. As iscusse, we consier a set representation for the QIDs, i.e., in an anonymize tuple t i each QID attribute taes the set of values of that attribute in the tuples that generate the QID. Hence, the istinct count count A(t i) of A s values in t i is less than or equal to. Accoring to the NCP measure, if the omain size of A is small, we lose more information if more than one values of A exist in a generalize tuple. For example, consier attributes sex an birthay with omain sizes 2 a66, respectively. If we put two tuples in a group with ifferent sex values the NCP for that attribute in the group will be maximize, but if we put two tuples with ifferent birthay, the introuce NCP error is small. Thus, uring partitioning, we shoul prioritize the reuction of count A(t i) for attributes with small omains. To achieve this goal, we orer the attributes accoring to their omain size; then the tuples in T are sorte in lexicographical QID orer, base on the attribute orering. The tuples are partitione in a top-own fashion. First, we consier attribute A 1 with the smallest omain size an put tuples with the same A-value in the same partition. This results in a set of partitions P 1, P 2,... P m, such that the NCP of A 1 will be in all partitions. However, some partitions may have less than tuples. For each such partition P j, we fin a neighboring partition P x, either P j 1 or P j+1, that can either be merge with P j or some tuples can be move from P x to P j in orer for both of them to have at least tuples. If P x + P j < 2, we merge P x with P j; otherwise, we move tuples from P x to P j such that P j =. After we are one with A 1, we recursively partition the resulting groups using the next attribute in orer (i.e., A 2). In some partitions, the tuples may have ifferent A 1 values (ue to merging). For such partitions, we o not attempt to further ecompose them recursively using another attribute. The partitioning strategy is repeate until all partitions are finalize or there are no more attributes that can be use for recursive partitioning. Figure 8 shows a pseuocoe of the partitioning algorithm. On each finalize partition, we apply non-homogeneous generalization, as explaine in Sections 5.1 an Cost analysis of partitioning Before we apply the partitioning algorithm, shown in Figure 8, we nee to sort the attributes an the tuples. Assuming that the number of attributes in the QID is negligible compare to the number of tuples, sorting costs O( T log( T )). Moving ata between partitions or merging, is applie only for consecutive partitions. Each partition efine by the first attribute can recursively be repartitione up to times, assuming a -imensional QID. As the ata having the same values in attributes A 1, A 2,... A i are sorte

9 Input: a set of tuples P ; parameter in -anonymity ; Attribute use for partitioning A x; Preconition: (i) attributes in QID are sorte in ascening orer of omain size A 1, A 2,..., A QID ; (ii) tuples are sorte in lexicographical orer accoring to attribute-orer; (iii) all tuples in P have the same value on A 1, A 2,... A x // minimize the uncertainty for attribute A i 2. Partition P into P 1, P 2,... P m using A i 3. // if there are not enough tuples in a group 4. For each P j where P j < 5. Fin P x as a neighboring partition of P j 6. If P j + P x > 2 7. move P j tuples from P x to P j 8. Else 9. Merge P j an P x 1. En for 11. For each P j 12. If ( t a, t b P j, t a[a i] t b [A i]) or (A i = A QID ) 13. Apply non-homogeneous generalization on P j 14. Else 15. partition(p j,, A i+1) // recursive call 16. En for Figure 8: Partitioning Algorithm w.r.t. attribute A i+1, no aitional sorting is require. In the worst case, where all tuples have the same values in all attributes the table will be rea times, so the worst-case cost is T ( + log T ). 6. EXTENSION TO L-DIVERSITY In this section, we iscuss how we can apply non-homogeneous generalization for other privacy principles. In particular, we focus on l-iversity, escribe in Definition 9. DEFINITION 9. (l-iversity) Let T be a table with a sensitive attribute S, an H be the projection of T on ey an QID attributes. l-iversity is preserve by an anonymize table T if t i H, s S, Pr(t i[s] = s) 1 in a lining attac by l joining H an T. By efinition, in orer to satisfy l-iversity each partition P shoul not contain a sensitive value that occurs more than P times. In l such a partition, we can orer the tuples in P such that no consecutive l tuples have a sensitive value that occurs more than once. For example, consier a partition of size 4, where there are 2 sensitive values s 1 an s 2, each appearing 2 times in the partitions. In orer to achieve 2-iversity, we can arrange the tuples with sensitive values in the orer of {s 1, s 2, s 1, s 2}. By applying ring generalization to the orere tuples, each generalize QID will cover two tuples with ifferent sensitive values. Hence, 2-iversity is satisfie. Thus, non-homogeneous generalization can irectly be applie on the partitions generate by any existing algorithm for l-iversity, lie [6, 26]. The utility will be higher than or equal to that of applying homogeneous generalization. However, as existing algorithms o not tae into account non-homogeneous generalization, they ten to generate partitions with minimal sizes, so the utility improvement by non-homogeneous generalization may not be maximal. We now iscuss how our partitioning strategy (escribe in Section 5.3) can be aapte to generate l-iverse partitions. Note that, the original table shoul satisfy l-iversity, otherwise it cannot be split to partitions which all satisfy l-iversity [26]. Recall that our partitioning strategy recursively ivies the tuples using one attribute at a time. In each iteration, we obtain a set of partitions P 1, P 2,..., P m. For a partition P j that oes not satisfy l-iversity, we fin a partition P x such that l-iversity is satisfie by the merge partition P j P x. If such a partition cannot be foun, we merge P j with a ranom partition an merging continues until the resulting partition satisfies l-iversity. In the worst case, the algorithm will merge all partitions into a single one, which must satisfy l- iversity. Hence, this scheme always gives a set of partitions that satisfy l-iversity. Then, for each partition, we orer the tuples an apply non-homogeneous generalization. As we have iscusse in Section 5.2.1, the cost of ranomization is highly correlate to the partition size. So, if a huge partition is generate by the above process, the generalization cost for it may be extremely high. In orer to reuce the cost, we perform arbitrary splits to such partitions, maing sure that each sensitive value appears in each smaller partition at most once. This way, each partition has a maximum size equal to the omain of the sensitive attribute, which is usually small in practice. 7. EMPIRICAL EVALUATION In this section, we experimentally compare our evelope anonymization scheme, enote by (non-homogeneous generalization), with the state-of-the-art -anonymity algorithm of [6]. The algorithm of [6] sorts the the tuples base on their values on a Hilbert curve an then generates partitions using a ynamic programming algorithm that is optimal for homogeneous generalization of 1-imensional QID. Although the resulting partitions may not be optimal, they have low information loss, since two nearby values on the Hilbert curve are also near in the original high imensional space with high probability. Range representation is use for generalize QIDs in [6], however, a set representation can irectly be applie on the partitions to improve utility. We use HR to enote the original algorithm of [6] with range representation an its version with set representation. Our algorithm applies non-homogeneous generalization on top of the partitioning scheme propose in Section 5.3 an uses set representation for the generalize QID. All optimizations for ranomization as escribe in Section are implemente in. In aition, we implemente a metho that uses our partitioning strategy followe by homogeneous generalization (using set representation), to verify whether any improvement in the information loss is achieve ue to our partitioning strategy or ue to non-homogeneous generalization. We enote this metho by (homogeneous partitioning)., after partitioning, uses a single generalize QID to represent all ata in each partition. Ranomization is not applie by this algorithm, since generalization is homogeneous. All algorithms are implemente in C++ an the experiments are run on an Intel Core 2 Duo 2.8GHz machine with 2GB RAM, running Winows. Our experiments are one mainly on a real ataset CENSUS (ownloaable from ) that is wiely use in the literature (e.g., in [27, 26, 6]). The ataset contains information about 5K iniviuals. A summary of the attributes in the ataset is shown in Table 5. Note that the majority of attributes are nominal, inicating that a set representation for a generalize QID is more appropriate than a range representation, since there is no natural orer for most of the attributes. In the experiments, we vary the following parameters: (i) number of tuples n: we sample the CENSUS ataset to generate input tables of varying size; (ii) number of attributes in the QID: we use the first attributes as the QID while other attributes are treate as others ; (iii) value of in -anonymity. The range of each parameter an the efault value is shown in Table 6. We measure the information loss of the generalize tables using GCP (efine in Section 5.3), which is a commonly use metric

Skyline Community Search in Multi-valued Networks

Skyline Community Search in Multi-valued Networks Syline Community Search in Multi-value Networs Rong-Hua Li Beijing Institute of Technology Beijing, China lironghuascut@gmail.com Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China yu@se.cuh.eu.h