K-Automorphism: A General Framework for Privacy Preserving Network Publication

Size: px

Start display at page:

Download "K-Automorphism: A General Framework for Privacy Preserving Network Publication"

Ralf Ellis
6 years ago
Views:

1 K-Automorphism: A General Framewor for Privacy Preserving Networ Publication Lei Zou Huazhong University of Science and Technology Wuhan, China zoulei@mail.hust.edu.cn Lei Chen Hong Kong University of Science and Technology Hong Kong leichen@cse.ust.h M. Tamer Özsu University of Waterloo Waterloo, Canada tozsu@cs.uwaterloo.ca ABSTRACT The growing popularity of social networs has generated interesting data management and data mining problems. An important concern in the release of these data for study is their privacy, since social networs usually contain personal information. Simply removing all identifiable personal information (such as names and social security number) before releasing the data is insufficient. It is easy for an attacer to identify the target by performing different structural queries. In this paper we propose -automorphism to protect against multiple structural attacs and develop an algorithm (called ) that ensures -automorphism. We also discuss an extension of to handle dynamic releases of the data. Extensive experiments show that the algorithm performs well in terms of protection it provides.. INTRODUCTION Social networ applications, such as hi (hi.com), faceboo (faceboo.com), and myspace (myspace.com), have become popular for sharing information. As a consequence, the amount of social networ data has grown rapidly, and this offers rich opportunities for data mining and analysis, for example, to find community groups and their evolution [, ]. However, social networ data usually contain users private information; it is important to protect these information in any sharing and mining activities. There are well-nown examples of unintended release of private information in released (also called published) data, causing organizations to become increasingly conservative in releasing these data sets. Realization of the promise of social networs requires addressing these concerns. Before publishing all these data for analysis, data mining, and other purposes, it is necessary to ensure that the pub- This wor was partially done when the first author was visiting University of Waterloo as a visiting scholar. The first author was partially supported by National Natural Science Foundation of China under Grant 777. The second author was supported by Hong Kong RGC GRF 8 and NSFC/RGC Joint Research Scheme N HKUST /8. The third author was supported by National Science and Engineering Research Council (NSERC) of Canada. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB 9, August -8, 9, Lyon, France Copyright 9 VLDB Endowment, ACM ----//. lished data will not contain any private information. In this paper, we focus on the identity disclosure problem [], which is one possible privacy lea concern when a social networ is published. Formally, given a published social networ G, if an adversary can locate the target entity t as a vertex v of G with a high probability, we say that the identity of t is disclosed. A naive anonymization method would remove all personal identifiable information before publishing the networ, such as names and social security numbers (SSN). However, even when a networ is published without any identity information, it is still possible to locate the target with a high probability based on some structural information around the target [7]. In fact, once it can be determined that a published vertex v of G corresponds to target t, all sensitive attributes associated with v can be recognized as t s private information. In other words, all the private information about t is released to the adversary, such as the ban account balance in a financial networ or infected diseases in a disease spread networ. Moreover, based on disclosed identities, some sensitive lin information can be easily derived as well. For example, if we can locate two targets t and t in G, it is straightforward to figure out whether there is an edge between t and t. The relationship between t and t may be sensitive and private. Therefore, identity disclosure is a critical factor to consider in publishing privacy-preserved networ data. In order to identify private information in released data, an adversary usually launches different queries using some bacground nowledge. For example, in tabular data publication, the quasiidentifier attributes (the minimal set of attributes that can be joined with external information to re-identify individual records) can be used as bacground nowledge. Similarly, any topological structure of the networ can be utilized to identify the target in the released networ. There can be four types of structural attacs in this environment [7,, ]: degree-attac, sub-graph attac, -neighbor-graph attac, and hub-fingerprint-attac. We discuss these in detail in Section ; the following example demonstrates the problem using one of these types. Example. Given a social networ G in Figure a, Figure b shows the naive anonymized networ G obtained by removing all individuals names in G. We assume that the target entity is Bob. Note that, the numbers beside vertices in Figure are not vertex labels, but vertex IDs that we introduce to simplify description. From Figure b, we can observe that if we now that Bob has four neighbors (i.e. degree attac), we can uniquely identify that Bob is vertex 7 in G. This simple example shows that a naive privacy-preserved published networ (one where the identities are removed) is still susceptible to these structural attacs. A number of recent wors study privacy-preserving networ publication [7,, ]. However, these

2 Alice Carol 9 Dave Bob Greg 7 Alice 8 David Max Jenny Fred (a) Original Networ G 7 8 (b) Naïve Anonymized Networ G 9 9 Figure : Anonymized Networs 7 8 (c) Another Anonymized Networ Ĝ suffer from the following limitations. ) All but one [7] of the proposals consider only a single type of attac [, ]. Furthermore, we do not now of any technique that can guarantee the privacy of a released networ under a subgraph attac (-neighbor sub-graph attac [] is a special case of sub-graph attac). In practice, it is not realistic to assume that an attacer would launch only one type of attac (i.e. an adversary has only one type of structural information about the target). One has to assume that there will be simultaneous multiple attacs and the techniques need to be resilient to all of them. ) The only solution that can handle multiple attacs [7] introduces considerable uncertainty into released networs. Since the released networ only contains a summary of structural information about the original networ, users have to generate some random sample instances of the released networ for further analysis. The samples come from a large number of possible worlds, which introduces much uncertainty, maing subsequent analysis difficult. ) Existing methods do not consider dynamic releases. This is important in evolutionary networs and dynamic social networ analysis [, ]. For example, given a series of online trading networs, such as ebay (ebay.com), based on community evolution in these networs, we can predict the trend of consumers purchasing behavior. These applications require republishing data at different times to support dynamic analysis. However, all existing privacypreserving networ publication methods consider only one-time release. Even though each released networ G t at time T t can guarantee privacy individually, an adversary can still identify the target with a high probability by collecting the information from multiple releases. Considering the above limitations, we propose a systematic method for privacy-preserving publishing of social networ data. The technique has three advantages. ) It can guarantee privacy under any structural attac. In our method, we do not assume one type of attac. We assume that an adversary can have complete information about the target, such as degree, neighbors, shortest-distances from hubs and so on. Our method can guarantee privacy, even though an adversary can launch multiple and different types of structural attacs. ) The released networ has no uncertainty. The released networ generated by our method can provide not only a summary of the structural information regarding the whole networ, but also structural information about each individual vertex. ) It can guarantee privacy under dynamic releases. Even though an adversary can have historical information about the target, the target cannot be identified in our released networs. The intuition of our proposed method is the following: Assume that there are - automorphic functions F a (a=,...,-) in the released networ G, and for each vertex v, F a (v) F a (v) (a a ). Thus, for each vertex v in G, there are always - other symmetric vertices. This means that there are no structural differences between v and each of its - symmetric vertices. Thus, it is not possible to distinguish v from the other - symmetric vertices using any structural information. Therefore, the target cannot Table : Meanings of Commonly Used symbols G The Original Networ G The Naive Anonymized Networ G The Anonymized Graph after Partition and Alignment. G The Anonymized Graph after algorithm G i The Anonymized Graph after GenID algorithm at time T t. U i A group of blocs. P ij A bloc in group U i be identified with a probability higher than. However, finding the automorphic functions F a is a ey problem. In this paper, we develop three techniques, namely graph partitioning, bloc alignment and edge copy. In order to address the privacy disclosure in dynamic releases, we propose vertex ID generation technique. In our method, the intersection of query results in different publications of the same networ G always has at least candidates. Thus the probability of identification is ept at. In summary, we mae the following contributions:. We propose -automorphism model for privacy preserving networ publication, which can guarantee privacy under any structural attac.. We propose a systematic method to protect the released social networ data from all structural attacs. Specifically, we propose an algorithm to convert original networ G into - automorphic networ G, which is then released.. We consider dynamic releases of networs. In order to avoid privacy disclosure in re-publication of networs, we propose vertex ID generation technique.. Extensive experiments and comparisons show the superiority of our methods. The remainder of the paper is organized as follows. Bacground nowledge and related wor are discussed in Section. -automorphism model is proposed in Section. We propose algorithm in Section. The privacy protection in dynamic releases is discussed in Section. We evaluate our method in Section. Section 7 concludes the paper.. PRELIMINARIES AND RELATED WORK We first briefly review the terminology that we will use in this paper. Note that, for ease of presentation, we use the terms graph and networ interchangeably, as well as the terms release and publish. Table I lists the symbols used in this paper. For most of the paper, we only consider non-labeled networs. A networ V, E is defined in the usual manner, where V is the set of vertices and E is the set of edges. DEFINITION.. Graph Isomorphism. Given two graphs Q = V Q, E Q and G = V G, E G, Q is isomorphic to G, if and only if there exists at least one bijective function f: V Q V G such that for any edge (u, v) E Q, there is an edge (f(u), f(v)) E G. DEFINITION.. Graph Automorphism. An automorphism of a graph G = V, E is an automorphic function f of the vertex set V, such that for any edge e = (u, v), f(e) = (f(u), f(v)) is also an edge in G, i.e., it is a graph automorphism from G to itself under function f. If there exist automorphisms in G, it means that there exists - different automorphic functions. DEFINITION.. Sub-Graph Isomorphism. Given two graphs Q and G, if there exists at least one sub-graph X in graph G such that Q is isomorphic to X under the bijective function f, graph Q is sub-graph isomorphic to graph G. We call X a sub-graph match (match for short) of Q in G. The vertex f(u) in G is called the match vertex with regard to vertex u in Q.

3 . Structural Attac DEFINITION.. Query. Given a social networ G, a query Q refers to any information that an attacer can use to extract private information from G. The result of Q is a set of vertices V V. Each v i V is called a match vertex. DEFINITION.. Structural Attac. Given a released networ G, if a query Q over G launched by an attacer has a limited number of match vertices in G, then target t might be uniquely identified. If Q is based on the structural information about t in G, this is called a structural attac. The definition of a query Q in this paper is restricted. Obviously, an attacer can also launch a query Q based on non-structural information (such as vertex label) to identify the target. In this paper, we only consider structural attacs. Four types of structural attacs have been identified in networ data publication [7,, ]. We illustrate these attacs through Figures and. ) Degree Attac. Assume that an adversary nows that Bob has four neighbors. According to the released networ in Figure b, the adversary can identify Bob uniquely, since only vertex 7 s degree is four. Vertex refinement attac [7] is a generalized type of degree attac. It assumes that an adversary can now the number of the target s n-hop neighbors, where n. Bob III II I V IV Bob III II I V IV VI Bob II I VII III (a) Q (b) Q (c) Q V Figure : Query Graphs VI Bob III II I V VII IV (d) Q ) Sub-graph Attac. Assume that an adversary nows that query Q (in Figure d) exists around Bob, and vertex I in Q corresponds to Bob (as in Figure, Roman numerals beside vertices in Figure are vertex IDs, not vertex labels). Since there is only one match of Q in the released networ, Bob can be uniquely identified. ) -Neighbor-Graph Attac. Assume that an adversary nows Bob s neighbors, and the connections among Bob and his neighbors, which is denoted by Q in Figure a. There is only one match of Q in the released networ. Thus, Bob s privacy will be compromised. As noted earlier, -neighbor-graph attac is a special case of sub-graph attac. ) Hub Fingerprint Attac. Assume that some hub vertices have been identified in the released networ. If an adversary nows the distances between Bob and these hubs, Bob may also be identified.. Information Loss and Utility In privacy preserving data publishing, it is necessary to balance usually conflicting concerns: privacy protection and information loss. Before we discuss our method in detail, we first propose measures of information loss and utility that we employ. In this wor, we use the anonymization cost to quantify information loss. DEFINITION.. Anonymization Cost. Given an original networ G and its anonymized version G, the anonymization cost in G is defined as Cost(G, G ) = (E(G) E(G )) (E(G) E(G )) where E(G) is the set of edges in G. The rationale of using anonymization cost to measure the information loss in G is that a lower anonymization cost indicates that fewer changes have been made to the original graph G. Furthermore, some techniques also use statistical networ measures to evaluate the utility of released networ [,, 7], such as degree difference, average shortest-paths and cluster coefficient. We also use these statistical measures in our experiments.. Related Wor Privacy-preserving data publication has received increasing interest in database community [7,,,, 8, 9,, 9,,,, ]. Most previous wor focus on tabular-data publication. With the increasing popularity of social networ applications, analysis of these data has also started to attract attention. How to publish social networ data for data mining and analysis without leaing any privacy information is an interesting problem that has been studied [7,,,,, 8, ]. A number of recent wors study privacy-preserving networ publication [7,, ]. As noted earlier, all of these except one [7] assume that an adversary launches only one type of attac. For example, Zhou and Pei study how to protect released networs from -neighbor-graph attac [], and degree attac is addressed by Liu and Terzi in []. However, in practice, an attacer can launch multiple attacs to identify the target and the techniques need to be resilient in the face of multiple attacs. As an example, given an original networ G in Figure a, according to existing methods [, ], we get an anonymized networ Ĝ in Figure c. In Ĝ, there are always at least vertices whose degrees are d, for any d. Furthermore, for any vertex v and its -neighbor graph NG(v), there always exists another vertex v, whose -neighbor graph NG(v ) is isomorphic to NG(v). Consequently, Ĝ can guarantee privacy under both degree attac and -neighbor sub-graph attac. However, if we now that query Q (Figure d) exists around target Bob, Bob s identity will still be compromised, since there is only one match of Q in Ĝ. Hay et al. [7] propose automorphically equivalent and discuss multiple attacs. However, the solution introduces considerable uncertainty in released networs. Given the same original networ G in Figure a, according to their graph generalization method, G is partitioned into two blocs. Each bloc is condensed into one super node, and the edges between two blocs are condensed into super edges, as shown in Figure. The number of edges in the super node X is denoted as X s weight d(x, X), and the number of edges between X and Y is denoted as the super edge (X, Y ) s weight d(x, Y ). Although this method can guarantee privacy in the released networ, there is much uncertainty in the released networ. Since users cannot now the structural information in each supernode, they need to generate samples of the super-graph for further analysis. Authors point out that the number of possible worlds for samples is: W (G) = X V ( X ( X ) d(x,x) ) X,Y V ( ) X Y d(x,y ) Obviously, the large number of possible worlds ( W (G) ) introduces much uncertainty in the released networ, which complicates the use of the released data. It also leads to large information loss, such as structural information about each vertex in original networs. Besides identity disclosure that we focus on, protecting sensitive lins among the individuals in the anonymized social networ has also received attention. Ying and Wu in [] study how anonymization algorithms that are based on randomly adding and removing edges change the spectrum of the networ. The lin-disclosure problem they define is orthogonal to identity disclosure problem discussed in this paper. Frien and Golle [] study how to protect privacy, if a group of authorities can jointly reconstruct a entire

4 G d( X, X ) = d( X, X ) = d( X, Y ) = Figure : Graph Generation Method graph G by assembling their own pieces of G. The problem is different from ours, since we only consider centralized data publication.. K-AUTOMORPHISM In order to guarantee privacy from any structural attac, we propose the following concept. DEFINITION.. K-automorphic Networ. Given a networ G, (a) if there exist - automorphic functions F a (a=,...,-) in G, and (b) for each vertex v in G, F a (v) F a (v) ( a a ), then G is called a -automorphic networ. Obviously, if G is a -automorphic networ, for any vertex v in G, we cannot distinguish v from its symmetric vertices based on any structural information. Thus, an adversary cannot identify v from G with a probability higher than. Therefore, the problem that we want to solve in this paper is defined as follows: Given an original networ G, find a networ G, where G is a sub-graph of G and G is a -automorphic networ. G is published as G s anonymized version. We require that G is a sub-graph of G. Thus, all lin information in the original networ G should be preserved in its anonymized version G. Furthermore, in order to minimize anonymization cost (Definition.), and to ensure that the anonymized networ is as similar to the original as possible, we require that E(G ) E(G) is minimized, where E(G) is the set of edges in graph G. In this paper, we propose a systematic method to convert a social networ G into a -automorphic networ G. Although our method addresses all structural attacs, for clarity, we only use subgraph attac as an example to show how we protect the privacy in released networ data. We prove in Theorems. and. that the released networ G produced by our algorithm is a -automorphic networ, which can resist any structural attac. Given a sub-graph query Q, if there are ( > ) matches of Q, no adversary can identify the target uniquely. Recall Example, where a naive anonymized networ G was generated by removing all names from the original networ G (Figure b). Given a sub-graph Q in Figure b, there are two matches in G. Thus, using query Q, an attacer cannot identify Bob uniquely. However, -match is not sufficient to avoid identity disclosure. If these matches have some shared vertices, the privacy associated with the shared vertices may still be compromised. For example, an adversary may now that query Q in Figure c exists around Bob. There are eight matches of Q in the social networ G : (7,, 9, 8,, ), (7,, 9, 8,, ), (7,, 8, 9,, ), (7,, 8, 9,, ), (7,,, 8,, ), (7,,, 8,, ), (7,, 8,,, ) and (7,, 8,,, ). However, these eight matches have the shared match vertices (i.e. vertices,7 in G ) with regard to vertex I and II in query Q. As shown in Figure c, vertex I corresponds to target Bob in query Q. In this case, although Q has eight matches, an adversary can still identify Bob in G uniquely. To overcome the limits of -match, we define the concept of different matches as follows: DEFINITION.. Different Matches. Given a sub-graph query Q and two matches m and m of Q in a social networ G, where m and m are isomorphic to Q under functions f and f, respectively, if there exists no vertex v (in query Q) whose match vertices in m and m are identical, (i.e. f (v) = f (v)), we say that m and m are different matches. Consider two matches m = (7,, 9, 8,, ) and m = (7,, 9, 8,, ) of query Q in Figure c. For vertex I in query Q, the match vertices of vertex I in m and m are both vertex 7 in G. Thus, m and m are NOT different. On the other hand, for two matches m = (7,, 9,, 8) and m = (,,,, ) of query Q, there exists no vertex v (in query Q ) whose match vertices in m and m are identical. Thus, these two matches are different. Therefore, in order to avoid identifying the target from the released networ G, we propose -different match principle. DEFINITION.. -different match principle. Given a released networ G and any sub-graph query Q, if (a) there exist at least matches of Q in G, and (b) any two of the matches are different matches according to Definition., then G is said to obey -different match principle.. K-MATCH () ALGORITHM Obviously, if a released networ G satisfies -different match principle, given any sub-graph query Q, no adversary can identify the target with a probability higher than. In this section, we propose our algorithm to find an anonymized networ G that satisfies -different match principle. We also prove that G obtained by our algorithm is a -automorphic networ and G is a sub-graph of G. Thus, G can guarantee privacy under any structural attac.. Overview We illustrate the main ideas of algorithm using Figure. Assume that we need to guarantee that the released networ G satisfies -different match (i.e. =). First, we partition the original networ G into blocs, P and P, as shown in Figure b. Then, we perform graph alignment on P and P to obtain two alignment blocs P and P (adding edge (, )), where P is isomorphic to P. The details of graph alignment are discussed shortly. Obviously, if a match of query Q is contained in alignment bloc P (or P ), it must also be contained in the other alignment bloc P (or P ). This means that, if a match of query Q can be contained in some bloc of the anonymized networ G, there must exist at least matches of Q in G. 7 8 Bloc Alignment 9 P (a) Naïve anonymization Networ G P (b) Graph Partition and Alignment G Edge Copy P Figure : Graph Partition and Edge Copy P (c) Edge Copy * G However, the anonymized networ after graph partition and alignment is still insufficient to satisfy the -different match principle. For example, given query Q in Figure d, there is still only one match of Q in the anonymized networ in Figure b. This is because the match crosses two blocs. In order to handle the problem, we propose the edge-copy technique (Section.). Since bloc P is isomorphic to P under function f, for each vertex v in P, there exists a corresponding vertex f(v) in P.

5 There is a crossing edge between vertex (in bloc P ) and vertex (in bloc P ) in Figure b; thus, we need to introduce another edge between vertex f() (that is vertex 9 in bloc P ) and f () (that is vertex in bloc P ). Figure c shows the anonymized networ after edge-copy. Given the same query Q, there are two matches m = (7,, 9,, 8,, ) and m = (,,,,, 9, ). Although these two matches have some shared vertices, they correspond to different vertices in query Q. Thus, according to Definition., these two matches are different. The K-Match algorithm ( for short) in Algorithm implements the approach we propose. We start by removing all identity information from the original networ G resulting in the naive anonymized networ G (line in Algorithm ). We then partition G into n blocs and cluster these blocs into m groups U i, (i =,..., m), where each group U i has at least blocs (line ). For each group, we perform graph alignment (we also use the term bloc alignment interchangeably) to obtain alignment blocs (lines -). We replace original blocs by alignment blocs to obtain networ G (line ). Then, we perform edge-copy to get G (line ), which is reported as the anonymized networ (line 7). The details of each step will be discussed shortly. Obviously, different partitionings of the original networ G and different bloc clusterings lead to different released networs G and different anonymization costs. We delay the discussion of finding a good partitioning and bloc clustering to minimize anonymization cost until Section.. According to Algorithm, each group U i has i blocs, where i. For ease of presentation and without loss of generality, we assume in the remainder that each group has exactly blocs. Algorithm K-Match () Algorithm Require: Input: An original graph G and the parameter. Output: The anonymized networ G, which is a -auto-morphism networ. : Generate G by removing all identity information from G. : Partition G into n blocs and cluster the blocs into m groups U i, (i =,..., m), where each U i has at least blocs P ij, (j =,..., i and i ). (see Algorithm in Section.) : for each group U i do : perform graph alignment on all blocs P ij in U i to obtain alignment bloc P ij. (see Algorithm in Section.) : Replace each bloc P ij by its alignment bloc P ij to obtain anonymized networ G. : For all crossing edges, perform edge-copy to obtain anonymized networ G. (see Algorithm in Section.) 7: Output G.. Bloc (Graph) Alignment For each group U i that is obtained following partitioning and clustering, we perform graph alignment operations on all blocs P ij in U i. Specifically, each bloc P ij is transformed into alignment bloc P ij, where all P ij are isomorphic to each other. Furthermore, P ij is a sub-graph of P ij. DEFINITION.. Alignment Vertex Instance. Given a group U i with blocs P ij, j =,...,, assume that alignment blocs P ij = (V ij, E ij) are the blocs obtained after graph alignment, namely, j P ij is a sub-graph of P ij and all P ij are isomorphic to each other. Due to graph isomorphism, given an alignment bloc P ij, for each vertex v in P ij, there must exist symmetric vertices in other blocs respectively. The set containing v and v s symmetric vertices form alignment vertex instance I, where I =. All alignment vertex instances are collected to form alignment vertex table (AVT). Example. We illustrate Definition. using Figure. The naive anonymized networ G in Figure a is partitioned into blocs P and P, which are clustered into one group U. Figure shows a graph alignment between two blocs P and P (note the addition of edge (, ) in P in order to get graph isomorphism between P and P ). Since P is isomorphic to P, for each vertex v in P (or P ), there exists a symmetrical vertex u in P (or P ). Therefore, each (v, u) is an alignment vertex instance I. Note that in this example, = ; thus, each vertex v has one symmetric vertex in the other bloc. Figure shows AVT, which contains all alignment vertex instances. P P P P Figure : Bloc Alignment DEFINITION.. Alignment Cost. Given a group U i with blocs P ij, j =,...,, assume that P ij are the blocs obtained after bloc alignment. The cost of bloc alignment in group U i is defined as follows: AlCost(U i) = j= Min(EditDist(Pij, P ij)) where EditDist(P ij, P ij) is defined as the number of graph edit operations (insert vertex/edge, delete vertex/edge) required to transform P ij into P ij. Since P ij is a sub-graph of P ij, EditDist(P ij, P ij ) = E(P ij ) E(P ij ), where E(P ij ) is a set of edges in P ij. The optimal bloc alignment in one group U i with blocs means that AlCost(U i) is minimal. THEOREM.. Finding the optimal bloc alignment for one group U i with blocs is NP-hard. PROOF. (Setch) Minimal graph edit distance problem can be reduced to finding the optimal alignment. The former is a classical NP-hard problem []. Due to the hardness of bloc alignment, we propose an heuristic algorithm (Algorithm ) to find a good alignment (not necessarily optimal) for blocs in one group U i. At the beginning of Algorithm, we call Algorithm to build the AVT, which contains all alignment vertex instances. According to AVT, for each vertex v, we can find symmetric vertices in the other blocs. If there is an edge between v and v in a certain bloc P ij, we need to introduce an edge between v s symmetrical vertices and v s symmetrical vertices in the other alignment blocs in the same group U i. Thus, the ey problem is how to build AVT. Given a group U i with blocs P ij, j =,...,, Algorithm implements our approach. Initially, we find vertices v ij from blocs P ij respectively, which have the same vertex degrees d. If there are multiple choices for d, we choose d with the largest value. If there is no vertex in one bloc P ij with the same degree as the selected vertices in the other blocs, we choose a vertex with the largest degree in P ij. The set of these vertices form the initial alignment vertex instance. We next perform breath-first search starting from v ij in each bloc P ij in parallel. During this process, we always pair up vertices (in blocs P ij respectively) with similar degrees. For

6 some vertex v ij in P ij, if we cannot find the corresponding vertex in the other blocs, we will introduce a dummy vertex, which has the same label as the corresponding vertex. Finally, we get an AVT containing all alignment vertex instances. Algorithm GraphAlignment: Bloc Alignment for Blocs Require: Input: A group U i with Blocs P ij, where j =,..., Output: Alignment blocs P ij, where j =,...,. : Call constructavt algorithm (see Algorithm ) to obtain AV T = constructav T (U i ). : Initialize blocs P ij, where j =,...,. : for j =,..., do : for each edge (v, v ) in P ij do : Insert edge (v, v ) into bloc P ij, and edges between v s symmetrical vertices and v s symmetrical vertices in other alignment blocs respectively. : Report all alignment blocs P ij, j =,...,. Algorithm constructavt(u i ), where j =,..., : Built AVT for a group with blocs P ij, where j =,..., : Set all vertices in each P ij as un-visited, initialize AVT : Find v ij in each bloc P ij, where all degree(v ij ) = d. If there are multiple choices for d, choose d with the largest value. If there are no choices for d, choose v ij with the largest degree from bloc P ij respectively : The set of all v ij form the initial alignment vertex I instance in AVT. : Perform breath-first search (BFS) starting from v ij in each P ij in parallel. : During BFS, vertices from blocs with similar vertex degrees are collected to form an alignment vertex instance in AVT. : Report AV T. The number of the iterations between lines - in Algorithm depends on the total number edges in blocs P ij, j =,..., i. Thus, time complexity of Algorithm is O( j= i j= E(Pij) ). Since BFS search time complexity in bloc P ij is O( E(P ij) + V (P ij) ), the time complexity of Algorithm is O(Max( E(P ij) + V (P ij) )). After graph alignment, we replace each bloc P ij in naive anonymized networ G (line in Algorithm ) by its alignment bloc P ij. In this way, we can obtain an anonymized networ G. Obviously, given a sub-graph query Q, if a match of Q is contained in a certain bloc P ij, Q is also contained in the other alignment blocs in the same group U i. This means that Q has at least different matches in G. As discussed in Section., the anonymized networ G after graph partition and bloc alignment still compromises some privacy. For example, if all matches of query Q crosses at least two blocs, we cannot guarantee that G satisfies the -different match principle. Consider G in Figure b that is obtained after bloc alignment. Query Q in Figure d has only one match in G. Thus, G does not satisfy -different match principle. This can be addressed by the edge-copy technique that is described in the next section.. Edge Copy According to line in Algorithm, there are m groups U i, i =,..., m. For each group U i, after bloc alignment, we have alignment blocs P ij, (j =,..., ). Furthermore, according to Algorithm, each group U i has an AVT A i. Union of A i, i =,..., m, are collected to form the overall AVT for the whole graph. The next step is to process the crossing edges between two blocs in G. As discussed in Section, the intuition behind our wor is to find - automorphic functions F a. According to these functions, we can convert the original graph G into a -automorphic graph G, which guarantees privacy under any structural attac. We define function F a (a =,..., ) based on AVT. We treat alignment vertex instance I in AVT as a circularly-lined list. Specifically, we use I.at(a) to denote the a-th vertex in instance I. We define v.next=i.at(a + ), if a I ; else v.next = I.at(), where I is the number of vertices in I. DEFINITION.. For each vertex v, we assume that v is in the instance I in AVT. v s automorphic function F a (a =,..., ) is defined as follows: ) F (v)=v.next; ) F a(v)=f (a ) (v).next, where a >. We illustrate F a using Figure. Since = in Figure, each instance has two vertices, and we have one automorphic function. Thus, F () = 9 and F (9) =. Similarly, for vertex, F () = 7 and F (7) =. In order to guarantee -automorphism under functions F a, we need to copy crossing edges which are defined as follows. DEFINITION.. Boundary Vertex and Crossing Edge. Given a vertex v in a bloc P, v is a boundary vertex if and only if v has at least one neighbor vertex that is outside of bloc P. An edge e = (v, u) is called a crossing edge if and only if v and u are boundary vertices in two different blocs. Assume that there is a crossing edge e = (v, u ). According to alignment functions, we introduce edges between pairs (F a(v ), F a(u )) (a =,..., ). For example, there is a crossing edge (, ) in G in Figure b. Since =, we introduce another edge between (F (), F ()) (i.e. (9, )). This results in the networ G (Figure c), which satisfies -different match principle. Note that, if there exists an edge between F a(v ) and F a(u ) in the original networ G, we will not introduce an extra edge (F a(v ), F a(u )). Algorithm Edge Copy Algorithm Require: Input: The original networ: G; The networ after graph partition and bloc alignment: G ; Alignment Vertex Table: AVT Output: The anonymized networ G. : Duplicate G into G and remove all crossing edges in G. : for each crossing edge (v, u) in the original networ G do : Add edge (v, u) and (F a(v), F a(u)) (a =,..., ) into G. : Report G as release networ. There are O( E(G) ) edges to be copied. Thus, the number of iterations of lines - in Algorithm is O( E(G) ). Therefore, the time complexity of Algorithm is O( E(G) ).. Graph Partitioning Algorithm relies on the partitioning of the naive anonymized networ G into n blocs that are clustered into m groups U i, (i =,..., m). As noted earlier, different partitionings and bloc clusterings will lead to different anonymization cost Cost(G, G )(defined in Definition.). In this section, we discuss how to find a good graph partitioning and bloc clustering to minimize Cost(G, G ). Since G is a sub-graph of G, Cost(G, G ) = E(G ) E(G), namely, Cost(G, G ) indicates the number of edges introduced by the algorithm. According to, we introduce edges in both bloc alignment (Algorithm ) and edge copy (Algorithm ). The effect of graph partitioning is quite difficult to quantify. For example, if we partition G into fewer blocs, there are fewer crossing edges to be copied. On the other hand, fewer blocs imply that the size of each bloc is large. Thus, we need to introduce more edges to perform bloc alignment. The optimal graph partitioning

7 of G implies that algorithm can produce a released networ G where E(G ) E(G) is minimal. Finding the optimal graph partitioning is NP-complete, since we can reduce a NP-complete problem (graph partition with min-cut) to this problem. Thus, we instead propose a heuristic partitioning algorithm. Theorem. shows the relationship between Cost(G, G ) and graph partitioning and bloc clustering. DEFINITION.. Given a group U i with blocs P ij, j =,...,, anonymization cost of group U i is defined as follows: Cost(U i ) = AlCost(U i )+. ( ) j= CrossEdge(P ij) where AlCost(U)is defined in Definition. and CrossEdge(P ij ) is the number of crossing edges associated with bloc P ij. In Definition., AlCost(U i) denotes the number of edges introduced during bloc alignment. For each crossing edge e = (v, u), we need to introduce copy-edges (F a(v), F a(u)). Each crossing edge is associated with two blocs. Thus,. ( ) j= CrossEdge(P ij) denotes the number of edges introduced into group U i during the edge copy process. THEOREM.. Assume that a networ G is partitioned into n blocs that are clustered into m groups U i, where each group U i has blocs. Let G be an anonymized networ produced by algorithm. Then Cost(G, G ) = m i= Cost(U i) where Cost(G, G ) and Cost(U i) are defined in Definitions. and., respectively. PROOF. (Setch) The number of edges that algorithm introduces equals to the sum of introduced edges in each group U i, i =,..., m. According to Theorem., we propose a greedy method to find a good graph partitioning and bloc clustering. In each step, we try to find the group U i with the minimal Cost(U i). Before introducing our method, we cite the following definition. DEFINITION.. Frequent Sub-graph []. Given a large graph G and a minimal support min sup, a graph g f is called a frequent sub-graph of G, if and only if g f has at least min sup matches in G, where any two matches are edge-disjoint. Algorithm Graph Partitioning and Bloc Clustering Require: Input: The naive anonymized G and. Output: a set of groups S = {U i }, (i =,..., m), where each group U i has blocs P ij, (j =,..., ). : repeat : Find frequent sub-graphs {g f } in G by setting minimal support min sup =. Find the frequent sub-graph g f with the largest number of edges. Each match of g f is extracted from G as one bloc P ij. : The set of all blocs P ij from one group U i. : repeat : set U i = U i. : for each bloc P ij in U i do 7: Expand bloc P ij by one hop. 8: The set of expanded blocs form group U i. 9: until Cost(U i ) < Cost(U i ) : G = G U i, and insert U i = {P ij } into answer set S. : until E(G) = : Report S = {U i }, i =,..., m. Algorithm implements our graph partitioning and bloc clustering. First, we find all frequent sub-graphs R = {g f } in naive anonymized networ G by setting min sup =. Then, we choose the frequent sub-graph g f with the largest number of edges in R. The rationale behind the choice is that we want to partition G into as few blocs as possible in order to minimize the number of crossing edges. All of g f s matches are extracted from G as the initial blocs (line ). These blocs are clustered to one group U i (lines,). The blocs in group U i are denoted as P ij. It is straightforward to see AlCost(U i) =, since these blocs are isomorphic to each other. If a vertex v is contained in more than one bloc (v is randomly arranged into one of them), we need to introduce dummy vertices for alignment, in which case AlCost(U i). However, AlCost(U i) is still small, since these blocs have high structural similarity. According to Definition., we can get Cost(U i) at this moment. Then, we expand all P ij in parallel. Let neighbor P ij denote all vertices that are one hop from boundary vertices in P ij. All vertices in Neighbor(P ij) are inserted into P ij (line 7). If a vertex v is contained in more than one Neighbor(P ij), v is randomly inserted into one of them. We align the new inserted vertices from each bloc P ij by degree similarity. We may need to introduce dummy vertices in the process. After expansion, AlCost(U i ) increases, since we need to introduce edges to align the new merged vertices. If all updated blocs P ij have fewer crossing edges than the original blocs, the number of introduced edges during edge copy will decrease. Bloc expansion continues if the overall group cost Cost(U i) decreases. Otherwise, it stops (Line 9). We remove all blocs P ij in U i from the naive anonymized networ G (line ), i.e, G = G U i. Note that, we choose edge-based partitioning. Specifically, all crossing edges of bloc P ij are chosen as separator to separate P ij from other parts of G. We then iterate the above process until E(G ) = (line ). Note that G is a non-labeled graph; thus, we always find frequent sub-graphs if E(G ). In the last iteration, the frequent subgraph may be an edge or an isolated vertex. Finally, G is partitioned into n blocs which are clustered into m groups U i. Each group U i has at least blocs.. Analysis and Discussion Although we illustrate our methods only using sub-graph attacs, we prove that networ G released by the algorithm is a -automorphic networ, which can guarantee privacy under any structural attac as defined in Definition.. Lemma. is used in the main Theorem. LEMMA.. The functions F a (a =,..., ) defined in Definition. are automorphic functions in the released graph G generated by algorithm. Furthermore, for each vertex v in G, F a (v) F a (v), where a a. PROOF. (setch) For each vertex v and each function F a (a =,..., ), there is a mapping vertex F a(v) (see Definition.). For two adjacent vertices v and v in G, there is an edge between v and v. If v and v are in the same bloc, according to AVT, we now that F a(v ) and F a(v ) are also in the same bloc. Due to bloc alignment, the two blocs are isomorphic to each other. Therefore, there is also an edge between F a(v ) and F a(v ). If v and v are in different blocs, then edge (v, v ) is a crossing edge. Due to edge-copy, there is also an edge between F a(v ) and F a(v ). Therefore, for any edge (v, v ), there is always an edge F a(v ) and F a(v ) under F a, a =,...,. According to Definition., F a is an automorphic function. Furthermore, according to Definition., for each vertex v, if a a, F a (v) F a (v). THEOREM.. Given an original networ G, the anonymized networ G produced by algorithm is a -automorphic networ. Furthermore, G is a sub-graph G.

8 PROOF. (setch) According to Lemma. and Definition., we now G is a -automorphic networ. Furthermore, according to algorithm, all edges in G are preserved in G. Thus, G is a sub-graph of G. According to -automorphism, we now that the released networ G can guarantee privacy under any structural attac. THEOREM.. For any structural attac, an adversary cannot identify the target with a probability higher than in any networ G that is released by algorithm. PROOF. According to Lemma., we can find automorphism in G. Thus, for each vertex v in G, there always exist - symmetric vertices. We cannot distinguish v from its other - symmetric vertices based on any structural information. Therefore, for any structural query, the released networ G produced by algorithm always contains at least symmetric vertices, or contains null. Thus, an adversary cannot identify the target with a probability higher than. There is a well-nown trade-off between privacy and accuracy for all anonymization techniques. As the networ is modified to achieve anonymization, it diverges from its original form due to the addition of vertices and edges. Therefore, there is a divergence of the results of an analysis on G from that on G. This inaccuracy occurs at varying degrees. In this wor, we want to preserve all edges in the original networ and guarantee privacy under any structural attac. To achieve this, algorithm introduces new edges. Actually, according to algorithm, the number of edges that we introduce to achieve -automorphism is bounded to ( ) V (E), where V (E) is the number of edges in the original graph G. There exist many non-trivial automorphism partitions of vertices in real networs []. Automorphism partition is similar to the concept of AVI (Definition.) in our paper. An AVI is an automorphism partition with no less than vertices. Essentially, it is only necessary to insert noisy edges to handle automorphism partition with fewer than vertices. The high symmetry property [,, 7] of real networs guarantees that algorithm introduces fewer noisy edges to satisfy -automorphism model and the worst case bound is unliely to be reached. In theory aspect, the utility of algorithm depends on how symmetrical the original graphs are. If there are many automorphism partitions [] with no fewer than vertices in original graph, we will introduce few noisy edges. The utility of will be good in this case; otherwise, the utility will degrade. Many real networs are nown to have high symmetry property [,, 7].. DYNAMIC RELEASES So far, we only consider one-time release of networ data. However, this is not satisfactory for evolutionary networs and dynamic social networ analysis [, ]. In these cases, it is necessary to re-publish networ data periodically. This poses additional challenges. Specifically, even though each publication may satisfy -automorphism, an adversary can still identify the target with a high probability when a sequence of releases of the networ are analyzed together. Therefore, to support dynamic networ analysis, it is necessary to be able to find connection between multiple releases of the networ while ensuring privacy (specifically, the enforcement of -automorphism) in these releases. These goals are usually conflicting and introduce complications. Example As shown in Figure, given a social networ G, we need to publish G at times T and T. According to algorithm, we get two released networs at times T and T, denoted as G and G, respectively. Although both G and G individually satisfy -automorphism, an adversary can still identify target Bob uniquely. Assume that an adversary nows that sub-graph Q * * Figure : Dynamic Release (see Figure d) exists around target Bob at both time T and T. According to sub-graph attac Q at time T, an adversary nows that there are two candidates vertices {, 7} for Bob in G. Similarly, at time T, there are still two candidates {, 7} for Bob. Since Bob exists at both T and T, it is straightforward to determine that vertex 7 corresponds to Bob. To remedy the above problem, a naive method is to remove all vertex IDs, or permute vertex IDs randomly (namely, a given vertex ID does not correspond to the same entity in different publications). However, if all vertex IDs are removed or vertex IDs are randomly permuted, we cannot locate the same individuals in different publications of this networ. It becomes impossible to conduct proper data analysis. In this section, we propose vertex ID generalization technique to reduce the ris of identifying the target in dynamic releases, while preserving vertex IDs as much as possible.. Vertex ID For ease of presentation, in this subsection, we assume that there are no vertex insertions or deletions in different releases of the networ. This means that the set of all vertex IDs remains unchanged in different publications of the same networ. Recall that we define functions F a (a =,..., ) in Definition. based on AVT, which are automorphic functions in G (see Lemma.). Given a structural query Q, if a vertex v satisfies query Q, vertices F a(v) (defined in Definition.), a =,...,, also satisfy Q. Thus, we can get at least results for any structural query Q. DEFINITION.. For a vertex v in a released networ G t at time T t, the set of v s corresponding vertices F a(v), a =,...,, and v are called v s associated query result set at T t, denoted as Res(v, G t ). It is straightforward to show Res(v, G t ) =, where G t is generated by the algorithm at time T t. LEMMA.. Given a series of publications Ω = {G t } of a networ G, (t =,..., s), a vertex v cannot be identified with a probability higher than, if the following holds: Res(v, G ) Res(v, G )... Res(v, G s ) = We illustrate Lemma. by Example in Figure 7, where =. For vertex, according to AVT A and A, F () = {9} in both G and G. Thus, Res(, G ) = Res(, G )= {, 9}. An adversary cannot distinguish these two vertices and 9. However, for vertex 7 in the networ, F (7) = {} in G and F (7) = {}

9 GIDT * G vertex v in G t, we replace the vertex ID v.oriid by its generalized ID v.genid to get networ G t (lines 8-9). We publish G t as a released networ at time T t (line ). After GenID algorithm, each networ G t is transformed into a -automorphic networ G t with generalized vertex IDs. The following theorem proves that G t will not lea any privacy under dynamic releases. Notice that, at time T, G is the same as G. We do not distinguish G and G in the following discussion. Algorithm Generalize Vertex ID For Released Networ G t * G Figure 7: Vertex ID in G. Therefore, Res(7, G )={, 7} and Res(7, G ) ={, 7}. If an adversary nows that the target exists at both times T and T, and launches sub-graph attac Q (in Figure d) to G and G, it is easy to locate the target as vertex 7. The following lemma gives the sufficient condition for a vertex v to be unidentifiable from a series of publications of the same networ. LEMMA.. Given a series of publications Ω = {G t } of a networ G, (t =,..., s), a vertex v cannot be identified with a probability higher than, if the following holds: Res(v, G ) Res(v, G )... Res(v, G s ) = Res(v, G ) where Res(v, G ) =. Lemma. shows the intuition of the proposed vertex ID generalization method in dynamic releases. First, we illustrate our method through Example. Assume that we have obtained - automorphic networs G and G produced by algorithm (Figure ) along with AVT A and A (Figure 7). Based on A, we can define automorphic functions in G. Similarly, we can define other automorphic functions in G based on A. Then, we generate an anonymized networ G, where each vertex in G is assigned a generalized vertex ID (see generalized vertex ID table GIDT in Figure 7). For each record r in GIDT, r.oriid is its original vertex ID, and r.genid is its generalized vertex ID. Initially, each generalized vertex ID has only one element, that is r.genid = {r.oriid}. In order to protect released networs in dynamic releases, we perform the following vertex ID generalization process. We now Res(7, G ) = {, 7} and Res(7, G ) = {, 7}. In order to avoid the disclosure of vertex 7, according to Lemma., we need to guarantee that Res(7, G ) Res(7, G ) = Res(7, G ) = {, 7}. Thus, in the released networ G, we need to insert into s generalized vertex ID, that is.genid = {, }. Figure 7 shows all generalized vertex IDs at time T. To release anonymized networ at time T, for each vertex v, we attach v.genid as its vertex ID. The GenID algorithm in Algorithm implements our approach, which only depends on the two AVTs A and A t, and the anonymized networs G and G t (obtained by algorithm at times T and T t). First, we initialize the generalized vertex ID table GIDT (line in in Algorithm ). We define automorphic functions Fa in G, a =,...,, based on A (line ). Similarly, we define automorphic functions Fa t in G t, a =,...,, based on A t (line ). Then, for each vertex v in G t, if Fa (v) Fa(v), t a =,...,, we insert vertex ID Fa (v) into Fa(v) s t generalized vertex ID, that is Fa(v).GenID t (lines -7). Next, for each Require: Input: AVT A for the networ G, and AVT At for the networ G t. Output: The anonymized networ after vertex ID generalization: G t :. Initialize table GIDT. : Based on A, define automorphic functions Fa in G, a =,...,. : Based on A i, define automorphic functions Fa t in G, t =,...,. : for each vertex v in G t : do for a =,..., do : if Fa (v) Fa(v) t then 7: Insert Fa (v) into Fa(v).GenID. t 8: for each vertex v in G t do 9: Replace v.oriid by its generalized vertex ID v.genid. : Report G t. THEOREM.. Given a series of publications Ω = {G t } (t =,..., s), where G t is generated by GenID algorithm at time T t, it is not possible to identify v in Ω with a probability higher than. PROOF. (setch) Consider any publication G t at time T t, t. According to GenID algorithm, for each v in G t, we now that Res(v, G t ) Res(v, G ). Thus, Res(v, G t ) Res(v, G ) = Res(v, G ). Therefore, Res(v, G ) Res(v, G )... Res(v, G s) = Res(v, G ). According to Lemma., Theorem. holds. In Section., we stated that we use anonymization cost to measure the information loss. In fact, the anonymization cost can nicely estimate the structural information loss since it counts the number of edges that we change before and after anonymization. However, the anonymization cost is not suitable for measuring the information loss of vertex ID generation, since this generalization relates to the content of a vertex ID. Therefore, we propose a new measure, called average generalized vertex ID size, to quantify the information loss caused by vertex ID generalization. DEFINITION.. Given a released networ G t produced by GenID algorithm, average generalized vertex ID size, denoted by AvgIDSize(G t ), is defined as follows: v V (G t ) v.genid AvgIDSize(G t ) = V (G t ) where V (G t ) is the set of vertices in G t. The above definition indicates that the larger the AvgIDSize (G t ), the higher is the information loss. We prove that the information loss caused by GenID algorithm is bounded. THEOREM.. For any publication G t at T t, where G t is produced by GenID algorithm, the following holds AvgIDSize(G t ) PROOF. Proof is based on Algorithm. Due to space limitation, we omit the details. The above theorem implies that the information loss in dynamic releases can be controlled by.

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,