(t,k)-hypergraph anonymization: an approach for secure data publishing

Size: px

Start display at page:

Download "(t,k)-hypergraph anonymization: an approach for secure data publishing"

Augustus Ross
6 years ago
Views:

1 SECURITY AND COMMUNICATION NETWORKS Security Comm. Networks 2015; 8: Published online 25 September 2014 in Wiley Online Library (wileyonlinelibrary.com) RESEARCH ARTICLE (t,k)-hypergraph anonymization: an approach for secure data publishing Atefeh Asayesh*, Mohammad Ali Hadavi and Rasool Jalili Data and Network Security Laboratory (DNSL), Department of Computer Engineering, Sharif University of Technology, Tehran, Iran ABSTRACT Privacy preservation is an important issue in data publishing. Existing approaches on privacy-preserving data publishing rely on tabular anonymization techniques such as k-anonymity, which do not provide appropriate results for aggregate queries. The solutions based on graph anonymization have also been proposed for relational data to hide only bipartite relations. In this paper, we propose an approach for anonymizing multirelation constraints (ternary or more) with (t,k) hypergraph anonymization in data publishing. To this end, we model constraints as undirected hypergraphs and formally cluster attribute relations as hyperedge with the t-means-clustering algorithm. In addition, anonymization is carried out with a k-anonymity method in every cluster for which the parameter k can vary in each cluster, to attain more flexibility and less information loss with respect to utility. Our experiments demonstrate that this approach offers a great trade-off between privacy and utility. Copyright 2014 John Wiley & Sons, Ltd. KEYWORDS data publishing; privacy; anonymization; hypergraph; clustering *Correspondence Atefeh Asayesh, Data and Network Security Laboratory (DNSL), Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. asayesh@ ce.sharif.edu 1. INTRODUCTION Numerous organizations publish microdata for a variety of purposes, such as demographic and public health research. Although attributes that clearly identify individuals, such as Name and ID Number, are usually removed, such databases can sometimes be joined with other public databases on attributes such as postcode, gender, and date of birth to reidentify individuals who were supposed to remain anonymous [1]. Linking attacks are easier by the availability of other databases over the Internet. According to one study, approximately 87% of the population of the United States can be uniquely identified on the basis of their postcode, gender, and the date of birth [2,3]. Such attributes are called quasi-identifiers. A large amount of data from various sources is required to produce useful statistical results. However, data analysis tools and methods may be maliciously used to disclose private and sensitive information. Therefore, privacy preservation becomes an essential issue in data publishing. Encryption changes data such that ad hoc queries cannot be answered correctly on the released/published database [4 6]. So, encryption-based methods such as in [7] lose generality for the purpose of data publishing. Database fragmentation as another approach leads to execution of queries in more than one database fragments and obtaining the final answer by the combination of received results [8,9]. However, statistical analysis on relations pertinent to more than two attributes is impossible and may lead to wrong ad hoc results. For example, having a constraint with three attributes postcode, gender, and date of birth, and aiming to preserve privacy, each attribute should be included in a separate fragment to misdirect statistical analysis relied on the attributes. Perturbing [10,11] and k-anonymity [2,7,12] models are two major techniques for such a goal. The k-anonymity model has been extensively studied because of its relative conceptual simplicity and effectiveness [10,13 15]. In this approach, data are anonymized before being released in order to prevent potential reidentification attacks. While in data publishing, privacy is aimed at concealing the association between attribute values [16,17], most of the existing anonymization approaches are limited to consider only bipartite attribute associations [18,19]. Nevertheless, in some real databases, it is possible to have privacy constraints in more than two attributes. In [13], Cormode et al. use(k,l)-grouping to anonymize associations between two entities. Their approach works only for binary constraints because they use bipartite graph in which there are two groups of nodes Copyright 2014 John Wiley & Sons, Ltd.

2 A. Asayesh, M. A. Hadavi and R. Jalili (t,k)-hypergraph anonymization Extending their method to multirelation constraints fails to meet utility preserving together with privacy concerns simultaneously. Meanwhile, this condition leads to wrong answers. This is more highlighted through Example 1. Example 1. assume a relation with four attributes, R ={product, date of birth (DoB), address, illness}, as shown in Table I, in which illness may have sensitive values. This constraint is defined as (1). fproduct; DoB; addressg illness (1) In this paper, (1) constraint means that the ternary association between product, DoB, and address can lead to the deduction of illness. By extending the proposed method of Cormode et al. for ternary associations [13], we should consider two bipartite graphs: (product, DoB) and (DoB, address), and anonymize them separately. Furthermore, the maximum of the lower bound and the minimum of the upper bound of the response set should be computed for aggregate queries. The reason that we consider two graphs is due to bipartite graph in which there are only two parts. But in our example, we have three attributes. So, we want to see the three parts of product, DoB, and address. Therefore, we divide them into two bipartite graphs. This approach leads to the lack of data in some situations. Consider the following scenario using Table I. For secure data publishing based on bipartite graphs, we fragment the relation into two relations with attributes (product, DoB) and (DoB, address). Next, we model each fragment as a bipartite graph and then anonymize the fragments by Cormode et al. anonymization technique [13]. It should be noted that we cannot consider one bipartite graph of ({product, DoB, address}, illness). Because we want to answer aggregated queries related to two attributes among {product, DoB, address}, if we model {product, DoB, address} as one node, queries such as number of people who buy a2 and their address is c2 cannot be supported. We process this query in two tables. First, number of people who buy product a2 with the answer 3, and second, number of people with address c2 with the answer 3, as well. So, the final answer is 3 but the correct answer is 1. It means that privacy is provided by dividing the relation using Cormode et al. approach, but query results are not always acceptable for desired goals. Our motivation is to solve this problem. Meanwhile, we want to model the database as graph instead of tabular models to gain more utility with acceptable privacy. Table I. A sample relation with four attributes. Illness Address Date of birth Product d1 c1 b1 a1 d1 c2 b1 a1 d3 c2 b1 a2 d2 c1 b2 a2 d3 c2 b3 a3 d4 c1 b4 a2 In this paper, we propose a utility-preserving approach for data anonymization with ternary (or more) constraints based on graph anonymization techniques instead of tabular techniques. Satisfying the privacy concerns, this approach prevents linking and reidentification attacks, while it yields a small amount of information loss. The database is modeled as a hypergraph and tuples as hyperedges. This is carried out in two steps: the first step is hypergraph clustering (t clusters), and the second step is cluster anonymization (k-anonymization) using localrecoding method. Our method is more flexible than the other reported research conducted in data publishing scenarios based on k-anonymization, such as in [20 22]. The remainder of this paper is organized as follows. Section 2 presents the preliminary concepts including definitions and privacy models. Section 3 describes our proposed approach. In Section 4, we analyze the information loss and distortion ratio of the approach. Section 5 focuses on processing aggregate queries on anonymized database. The experimental evaluation is given in Section 6. Finally, the paper is concluded with the advantages of the approach in Section PRELIMINARIES In this paper, without loss of generality, ternary constraints are considered to develop proposed method. We can simply generalize our method to adopt privacy constraints on an arbitrary number of attributes. Also, the database is static, and the goal is to answer aggregate queries with aggregation functions such as COUNT and SUM. Some basic definitions are presented later. Definition 1. Hypergraph model: A hypergraph is the generalization of a graph wherein edges, called hyperedges, can connect more than two vertices. The formal definition is denoted as a set of vertices V and a set of hyperedges E among those vertices. A hyperedge e j 2 E is a subset of vertices in V. Graphs are special instances of hypergraphs where each hyperedge has exactly two vertices [23]. Definition 2. Quasi-identifier attribute set (QID): A quasiidentifier attribute set is a set of attributes in a relation that potentially reveals private information, possibly by joining with other tables [2]. As another definition, a quasi-identifier is a set of attributes that can be linked with external information with the purpose of reidentifying the individuals to whom information refers. In Example 1, attributes {product, DoB, address} are quasi-identifiers. In our scenario, QID is not null. Definition 3. Sensitive tuples: Tuples that may contain private values in an attribute of non-qid set are called sensitive tuples. Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 1307

3 (t,k)-hypergraph anonymization A. Asayesh, M. A. Hadavi and R. Jalili Definition 4. Equivalence class: An equivalence class of a relation with respect to an attribute set is the set of all tuples in the table containing identical values for the attribute set. Definition 5. k-anonymity property: A relation is k- anonymous if the size of every equivalence class with respect to the quasi-identifier attribute set is k or more. k-anonymity requires that every occurrence within a quasiidentifier attribute set has the frequency of at least k [2]. Definition 6. k-anonymization: A view of a table is said to be k-anonymous, if the view satisfies k-anonymity property with respect to quasi-identifiers [2]. Definition 7. The t-means-clustering algorithm: The t-means-clustering algorithm was developed by J. MacQueen [21] and also by J. A. Hartigan and M. A. Wong [24]. This algorithm classifies the objects based on attributes/features into t groups where t is a positive integer number. The grouping is carried out by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of t-means clustering is to classify the data. In the beginning step of t-means clustering, the number of clusters, that is t, is determined. Thus, any t random objects can be chosen as initial centroids. The other method is to use the first t objects in the sequence as initial centroids [24,25]. The t-means algorithm is described in Figure OUR APPROACH In general, it is desirable that the released data strives to preserve data utility as much as possible. To achieve this goal, we design a utility-preserving anonymization method that yields a small amount of information loss, while it satisfies k-anonymity property. In our approach, the database is modeled with a hypergraph. Also, we expose the hypergraph as an incidence matrix (V). In this matrix, the rows indicate the values of QID attributes, and columns demonstrate the association of attributes, known as hyperedges. In this paper, we consider QID with only three attributes, but it can be extended to more. As a formal definition, we have 8 9 v 11 v 12 v 1n >< v 21 v 22 v >= 2n Matrix V mn : >: >; v m1 v m2 v mn So, we can define V m n as follows: m : thesum of the counts of different values f or each atribute in QID n : the number of tuples (2) i 2 f1; ; ng : if fa i ; b i ; c i g 2 f1; ; mg then e i 2 E ; V ai;e i ¼ 1; V bi;e i ¼ 1 ; V ci;e i ¼ 1 It should be noted that the number of rows depends on the number of tuples with different values in each domain and that number could be very big, even larger than the number of tuples. So, the space occupied with the table should be high when we have m rows with p attributes in QID. But we use incident matrices that have a lot of elements with the value of 0. So, our matrix is sparse. Also, there are a lot of algorithms to store sparse matrix with optimum space [26,27]. In our approach, we did not use optimum sparse matrix storage, but in future work, we can solve this problem with mentioned algorithms in [26,27]. Example 2. Table II shows a relation of four attributes, namely, postcode, gender, DoB, and illness; and Table III shows the corresponding hypergraph representation of Table II. In Table III, we add one extra row to show sensitive tuples in the corresponding hypergraph. In Table III, every column represents one tuple in which the elements related to the values of considered tuple were set to 1. Our contribution is twofold. First, we introduce an algorithm for clustering the hypergraph in which the resulted intracluster similarity is high. Second, we propose an algorithm to anonymize each cluster with minimum information loss. These two steps lead to more flexibility and less information loss with respect to utility. In the next two subsections, the method is described based on clustering and anonymization. Table II. A sample relation. Postcode Gender Date of birth Illness Figure 1. The flowchart of t-means clustering algorithm Male 11/07/1987 Flu 4350 Male 05/02/1990 HIV 4357 Female 09/10/1985 HIV 4357 Male 11/07/1987 Fever 4357 Male 04/12/1995 HIV 4340 Female 09/10/1985 Fever 1308 Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

4 A. Asayesh, M. A. Hadavi and R. Jalili (t,k)-hypergraph anonymization Table III. Hypergraph representation of QID of Table II. e 6 e 5 e 4 e 3 e 2 e Female Male /07/ /10/ /02/ /12/ Sensitive 3.1. Clustering step Some previous anonymization techniques use a greedy approach for anonymization [2,21]. Instead, we focus on a clustering-based approach. We partition the hyperedges into groups of similar vertices and then anonymize vertices within each group. This enables us to have a more suitable hypergraph anonymization in terms of clustering information. In this section, the clustering algorithm bounded t-means is used. The classic clustering technique is known as the k-means algorithm. To avoid confusion with the k in k-anonymity, we use t-means throughout the paper. Many clustering algorithms need a cluster center, usually a point in the domain of dataset. One idea is that the center of a cluster be one of the nodes in that cluster. With this idea, we choose sensitive tuples as samples and pick t of them randomly as cluster centers. Now, the samples are grouped based on cluster centers. Therefore, sensitive tuples are distributed among different clusters that provide diversity in grouping, leading to more privacy. However, this may yield poor results if the centers do not provide an accurate representation of the cluster. For example, when two center points are close to each other, it is better to put them in one cluster rather than being in different groups. To tackle the problem, we work with virtual centers. The virtual centers do not need to be the real nodes of the graph. Therefore, the first step is to find the virtual centers, by grouping the relation into clusters. A sampling procedure is used to determine the virtual centers (step 1.1). Using the specified virtual centers, we group the entire database into clusters (step 1.2) by minimizing the sum of squares of distances between data and the corresponding virtual centers. The last step is to anonymize each database cluster. Our clustering method is shown in Algorithm Step 1.1: finding virtual centers. In this step, we want to find the best center points for each cluster. To this end, two main steps are performed. One is sampling, and the other is clustering on samples Sampling. To deal with large datasets, instead of clustering the whole database at once, which is computationally demanding, a sample out of the preprocessed database is picked consciously. The samples in our approach are sensitive tuples. In this step, tuples of database are divided into two parts: sensitive and non-sensitive tuples. Sensitive tuples are tuples that have a sensitive value in an attribute of non-qid set. Because the goal of secure publishing is answering aggregate queries, attributes of QID are not sensitive. Our approach assumes publishing data when it is not sensitive by its own. For example, if DNA or income is a sensitive attribute, for all of their values, they must not be published. In other words, data publishing is performed relying upon anonymization techniques for data values that are not intrinsically sensitive, but their association with other values reveals private information of an entity Cluster on Samples. We apply clustering on the samples to find t center points. The t-means algorithm takes an input parameter t and partitions a set of n objects into t clusters so that the resulted intracluster similarity is maximized. The t-means algorithm is applied to the samples for clustering the sensitive tuples into t groups, based on their similarities. Sim (e i, e j ) in (3) is the distance between e i and e j where e i and e j are tuples and d i and d k represent their vectors (p is the size of QID, i.e.p = 3). X p 2 sim e i ; e j ¼ d ik d jk (3) k¼1 After clustering the sample tuples into t groups, the center points of the clusters are computed. We adopt the bounded t-means algorithm in [14] to specify the virtual centers. In particular, consider a cluster of sensitive tuples {st v1 st vb }. Each tuple is assigned to a vector with m (all QID values) coordinates over {0, 1}. Then, the virtual center is a vector of f^c 1 ; ;;^c m g such that ^C j ¼ X b st vij i¼1 b j 1 j m (4) In (4), virtual center of cluster j has been computed as ^C j where b j is the number of tuples in a particular cluster Step 1.2: clustering the entire database. Let ^C 1 ; ;;^C t be t cluster centers resulting from step 1.1. Now, we partition all tuples of the database into t clusters, where cluster C i contains all tuples whose closest center point is ^C i. By grouping similar tuples into the same cluster, less information loss with high privacy is obtained by using the sampling procedure, compared with the situation when all the tuples are clustered at once without sampling. The evidence of this improvement is the size of clusters with and without sampling. Clusters of balanced sizes facilitate data handling in later anonymization step. Our experimental study shows that the sizes of clusters with virtual centers are of less skewed distribution. For example, with t = 3, the size of clusters by real centers (without sampling) varies from 300 to , while with Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 1309

5 (t,k)-hypergraph anonymization A. Asayesh, M. A. Hadavi and R. Jalili virtual centers, it varies around to More details of our experimental evaluation are explained in Section 7. Example 3. Let t = 2 for clustering Table II. Table IV shows two clusters as the result of applying our clustering algorithm on Table II. Algorithm1: Clustering Inputs: hg: Main database modeled as a hypergraph V: Hypergraph modeled as an incident matrix t: Required Number of clusters h: number of all sensitive tuples e: Tuples modeled as hyperedges d: Hyperedges represented as vectors of {0,1} m: The size of all QID values Outputs: b: The maximum size of each cluster Cluster[t]: t clusters with at most b tuples in each cluster 1- Function Make Incident Matrix(hg) { Make matrix V from hg based on (2) Add extra row to show sensitive tuples // 1 for sensitive & 0 for non-sensitive tuples Return matrix V } 2- Function Pick_sensitive_tuples (V) { Choose all sensitive tuples from V as samples. return {e 1,...,e h } } 3- Choose t numbers of samples randomly.{e 1,...,e t } 4- Function Pre-Cluster_Samples( {e 1,,e t })// t-means clustering algorithm { // (e i and e j are tuples and d i and d j represent their vectors.) Find similarities between every sample and t clusters. X p 2// sim e i ; e j ¼ d ik d jk p is the size of QID k¼1 Find the minimum of sim to find the best cluster for sample e i. //sensitive tuples in each cluster after clustering: {st 1 st b } return {st 1 st b } } 5- Function Calculate mean ({st v1 st vb }) //compute virtual center { X b st vij i¼1 ^c j ¼ b j //(b is number of tuples in a particular cluster) return virtual centers as vector of ^C 1 ; ;;^C t 6- Cluster V based on ^C 1 ; ;; ^C t // final clustering 7- Return Cluster[t] Cluster 1 Table IV. Hypergraph after clustering. Fever 09/10/1985 Female 4340 HIV 04/12/1995 Female 4357 Cluster 2 Flu 11/07/1987 Male 4350 HIV 05/02/1990 Male 4350 HIV 05/02/1985 Male 4357 Fever 11/07/1987 Male Anonymization step In the second phase, we anonymize each cluster with a k-anonymity method. To this purpose, we use localrecoding method [15]. Some methods anonymize a given database by mapping the values in the domain of quasi-identifier attributes to modified values. This is known as global recoding. Alternatively, some methods modify individual instances of data items, using local recoding. Two main local-recoding methods have been proposed in the literature. The first one, cell suppression, produces the anonymized database by suppressing individual cells of the original database [14,26]. The second one, cell generalization, maps individual cells to their generalized values using a hierarchy-based generalization model [18]. The global-recoding method causes too much distortion to a database. Therefore, we prefer using a local-recoding method. However, optimal local recoding is a non-deterministic polynomialtime hard problem, and good heuristic algorithms are required to achieve k-anonymization. Here, we propose a local-recoding algorithm with cell generalization using a top-down approach. In this algorithm, a tree is constructed for each attribute in QID set. At first, each equivalence class is generalized to the root of the tree. Then, QID attributes of tuples in an equivalence class are specialized in iterations, separately. During the specialization, k-anonymity should be satisfied. The process continues until we cannot specialize the tuples. It should be noted that when we group the tuples in to classes, if the size of a class is less than k, we add fake tuples to that class. Fake tuples are non-sensitive tuples that have common values of QID with that of equivalence class. Example 4. Three hierarchical trees related to Example 3 are shown in Figure 2. We show anonymization procedure on cluster 1 for QID attributes (postcode, DoB, and gender) step-by-step in Figures 3 5. Based on the generalization tree and local-recoding method, the anonymization of Table I with k = 2 results in clusters 1 and 2, indicated in Table V Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

6 A. Asayesh, M. A. Hadavi and R. Jalili (t,k)-hypergraph anonymization Algorithm2: Hypergraph Anonymization Inputs: hg: Main database modeled as a hypergraph V: Hypergraph modeled as an incident matrix t: Required number of clusters k: The size of equivalence class numt: number of tuples in a cluster Outputs: anon DB: Anonymized database 1- Function Cluster (hg,v,t) //Algorithm1 { return Cluster[t] } 2- Function AnonymizeCluster (Cluster[t]},k) { //Group similar tuples in the same class num = 0; WHILE there is a class with a size of less than k or num <=numt DO Find a class with minimum size (MIN Class). Choose the best class(ind Class) to combine with MIN Class// by using XOR method to find a class with the value of 0 in the result of XOR Calculate new size of IND Class Remove MIN Class num = num + 1; END WHILE Add fake tuples if the size of an equivalence class is less than k in each cluster. Generalize each equivalence class with the local recoding generalization method. } In this paper, we propose a framework for anonymizing multirelation constraints (ternary or more) with hypergraph anonymization in data publishing. Our framework is twofold: (1) clustering and (2) anonymization. Our contribution is much more in the first part. Furthermore, any anonymization algorithm can be used in the second part. We used an anonymization algorithm that is based on local-recoding method and almost is similar to many anonymization approaches such as in [2,7,22] basically with the change in specialization and adding fake tuple. The difference between our algorithm and other tabular anonymization algorithm is that we use a preprocessing approach named clustering consciously. This causes that greedy method of making classes work faster in anonymization step. The level of anonymization may not be of much difference in terms of algorithm. However, in overall view, the result of two steps is better than similar methods like in [2,7,22]. The advantage of our method is utility preserving in finding equivalence classes, because similar tuples are grouped into the same clusters at the first phase. In other words, grouping similar tuples into the same equivalence class is much more probable in the proposed approach, compared with the anonymization without clustering. Moreover, the time of constructing Figure 2. Postcode, DoB, and gender generalization trees. Figure 3. Postcode generalization steps for cluster 1 (Table I). Figure 4. Gender generalization steps for cluster 1 (Table I). an anonymized table is much less than using the general k-anonymity method. Furthermore, we can choose different values of k for each cluster to reach more utility in the anonymization. This leads to less information loss. Algorithm 2 describes the overview of (t, k)-hypergraph anonymization. Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 1311

7 (t,k)-hypergraph anonymization A. Asayesh, M. A. Hadavi and R. Jalili 5. INFORMATION LOSS In the proposed algorithm, there are two kinds of information loss: information loss due to generalization (GIL) and information loss due to adding fake tuples (FIL). Therefore, the total information loss consists of both GIL and FIL, computed individually GIL Figure 5. DoB generalization steps for cluster 1 (Table I). Cluster 1 Table V. Anonymized hypergraph. Fever Female 43** HIV Female 43** Cluster 2 Flu Male 4350 HIV Male 4350 HIV Male 4357 Fever Male FLEXIBILITY IN DETERMINING K FOR EACH CLUSTER As it was explained in anonymization step, different values of k in each cluster lead to more flexibility and utility. Now, the question is which value of k in each cluster results in less information loss. Suppose we have another parameter β related to privacy. This parameter demonstrates that the maximum number of sensitive tuples in each equivalence class is not more than β% of all tuples in that class. For example, consider the number of sensitive tuples and anonymization parameter in cluster 1 are h 1 and k 1, respectively. The maximum number of sensitive tuples in each equivalence class is computed as h 1 = NoE 1 β k 1. By defining NoE and β initially, k can be found in every cluster recursively. This process results in less information loss and more utility due to flexibility in determining k in each cluster. It should be noted that we present h 1 with the condition that the tuples in each equivalent class in lower than k. In the situation that we have a class with more than k tuples, we can find h for that class separately. Number of equivalence classes (NoE). The distortion value is equal to the distortion of the generalized dataset divided by the distortion of the fully generalized dataset, where the fully generalized dataset is one with all values of attributes, generalized to the root of the tree. So, GIL is computed by (5). GIL ¼ X t i¼1 X bi X p j¼1 k¼1 AGN i; j;k h agnk pn AGN i, j,k is the generalization height of kth QID attribute in tuple j of cluster i, and h agnk is the height of the generalization tree for kth QID attribute. In the previous example, GIL is computed as GIL ðexampleþ ¼ X t X bi i¼1 j¼1 PGN i; j h pgn þ DoGN i; j h dogn pn þ GeGN i; j h gegn (5) PGN i, j is the generalization height of postcode, DoGN i, j is the generalization height of DoB, and GenGN i, j is the generalization height of gender in tuple j of cluster i. n is the number of tuples, k is the anonymization factor, and b i is the number of tuples in each cluster. h pgn, h dogn, and h gegn are heights of the generalization tree for QID attributes. p is the size of QID set. In our example, h pgn =4, h dogn =7, h gegn = 1, and p = FIL In the anonymization phase, fake tuples will be added if the size of an equivalence class is less than k. This distortion value is equal to the number of all added fake tuples divided by the total number of tuples in database. So, FIL is computed by (7). X t X ui FTN i; j i¼1 j¼1 FIL ¼ (7) n where FTN i,j is the number of fake tuples in the equivalence class j of cluster i and u i is the number of equivalence classes in cluster i. The total information loss (TIL) is computed by (8), accordingly. (6) TIL ¼ GIL þ FIL (8) 1312 Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

8 A. Asayesh, M. A. Hadavi and R. Jalili (t,k)-hypergraph anonymization If it is desired to evaluate information loss as a rating parameter, we should compute the maximum value of TIL and compute distortion ratio based on information loss. For the maximum value of GIL, we have the total generalization of all tuples (9). When each tuple is placed in a separate equivalence class, the maximum value for FIL is obtained and k 1 fake tuples should be added for each tuple (10). Equation (11) shows the maximum value of information loss. Max GIL ¼ pn n ¼ p (9) ½nðk 1ÞŠ Max FIL ¼ ½n þ ðnðk 1ÞÞŠ ¼ k 1 k Max TIL ¼ Max GIL þ Max FIL ¼ p þ k 1 k Finally, we have 0 TIL p þ k 1 k Distortion ratio is obtained by (13). Distortion 2 X t i¼1 DR ¼ 6 4 RatioðDRÞ ¼ TIL Max TIL X bi X p j¼1 k¼1 pn AGN i; j;k h agnk 0 þ p þ k 1 k X t X ui FTN i; j i¼1 j¼1 n (10) (11) (12) 13 C A (13) 7 5 In our example, distortion ratio is computed as (14) 6. QUERY ANSWERING OF (T,K)- HYPERGRAPH ANONYMIZATION Some previous approaches on anonymization works on generalization based on tabular data such as in [15,27]. Their framework is also based on generalization of quasiidentifiers. However, none of these previous works discuss their impact on the accuracy of answering aggregation queries. In 2007, Zhang et al. proposed an approach for aggregate query answering on anonymized tables [28]. However, as in majority of works on k-anonymity [19], this work focuses only on tabular methods, whereas our approach is based on graph algorithms with lower information loss in addition to better efficiency in answering queries. The reason is that graph-based algorithms model data and relations as nodes and edges can lead to high speed to find edges and also less information loss because of clustering edges in the first step of our algorithm. In our point of view, it is desirable to answer aggregated queries correctly with minimum information loss. With this purpose, graph-based algorithms are good solutions. In our approach, there should be a query processor to provide acceptable answers in response to users queries. Because the main goal in data publishing is aggregate analysis of data, the server responds only to aggregate requests. Every COUNT and SUM queries related to QID attributes will be answered with investigating vertices of hypergraph with desired conditions. For example, if we want to answer the query the number of people whose postcode is 4350 and have dangerous HIV disease, we should consider all edges in the set {4350, 435 *, 43 * *, 4 * * *, * * * *} connected to HIV vertex. If we show the number of edges that satisfies our query condition with (15), the lower bound value of answer is computed by (16), considering TIL parameter. This means that by the anonymization step, some edges corresponding to our query condition may not be correct edges. Also fake tuples may be added to the database. So, it is necessary to prune the E qry, using TIL. It should be noted that E qry is the number of edges that meet the requirements of our query. ans high bound ¼ E qry (15) Distortion 20 DR ¼ 6 4 X t RatioðDRÞ example ¼ TIL Max TIL X bi i¼1 j¼1 PGN i; j h pgn þ DoGN i; j h dogn np þ GeGN i; j h gegn p þ k 1 k 1 0 C A þ X t X ui FTN i; j i¼1 j¼1 n 13 C A 7 5 (14) Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 1313

9 (t,k)-hypergraph anonymization A. Asayesh, M. A. Hadavi and R. Jalili ans lowbound ¼ E qry E qry TIL p þ k 1 k! ¼ E qry 1 TIL p þ k 1 k The final answer is bounded as in (17). (16) ans low bound final answer ans highbound (17) Our approach supports average (AVG) queries too. This is demonstrated by Example 5: Example 5. What is the average age of people who have dangerous HIV disease For this query, we should find edges that connect to HIV disease and find DoB vertices of those edges. Then, we should compute ages related to DoB. For example, if DoB is 11/07/1987, the age is = 26, and if DoB is 85 90, the age is 2013 [( )/2] = 25. After computing ages of all vertices placed in the answering set, we will compute the sum of these ages. But the fact is that some of these ages might be due to fake tuples. So, we should compute minimum and maximum values for average. In this case, we compute the average value of all answering age as an upper bound. For the lower bound answer, we compute the minimum number of edges placed in the answer set related to Equation (16). We call these edges as min-num. Then, we find min-num value of vertices related to the age with the minimum value. Furthermore, we compute the average of lower bound. For example, assume our answer set of age is [22, 23, 25, 26] and ans lowbound = 3 related to Equation (16). Also ans highbound = E qry = 6. So, the maximum value of AVG is ( )/6 = 24. Lower bound of AVG is computed from three minimum values of answer set [22, 23]. So, we have ( )/3 = 23. At theend,wehave23 AVG 24 For MAX and MIN queries, we should find edges that encompass our query condition. For this situation, we return to anonymized hypergraph and find vertices that concern to the condition of MAX or MIN. Then, the edges connected to those vertices are found. Finally, we use formula (17) to find bounds of answer. 7. EMPIRICAL STUDY We implemented our algorithm using MATLAB on a system with a 2.4-GHz processor and 4 GB of RAM. In our experiments, a dataset of records is generated, health database, with four attributes, postcode (four digits), DoB, gender, and illness. This algorithm is evaluated in terms of the execution time and distortion ratio. We conducted the experiments for different k from 2 to 7 with the constant t = 3. This has been performed with and without clustering. As shown in Figures 6 and 7 with clustering step, higher values of k results in more information loss and distortion ratio. This variation is not significant, Information loss Distortion ratio K-Size of equivalence classes Figure 6. Information loss with different values of k. X: 2 Y: X: 2 Y: X: 3 Y: X: 3 Y: X: 4 Y: X: 4 Y: X: 5 Y: X: 5 Y: X: 6 Y: X: 6 Y: X: 7 Y: X: 7 Y: with clustering without clustering K-Size of equivalence classes Figure 7. Distortion ratio with different values of k. however. It can be observed in Figure 8 that the execution time of processing our algorithm with clustering is much less than when we do not have clustering. Our approach is implemented with and without sampling for t = 3 and t = 5. As shown in Figures 9 and 10, with more clusters, less information loss and consequently better utility is obtained. It has been demonstrated in [14] Execution time / sec K-Size of equivalence classes with clustering without clustering Figure 8. Execution time with different values of k Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

10 A. Asayesh, M. A. Hadavi and R. Jalili (t,k)-hypergraph anonymization Scenario 3: Setting t = n and k =1 gives a (n, 1)- hypergraph anonymization. E = E (hyperedges of anonymized hypergraph) and HG (anonymized hypergraph) exactly represents the original hypergraph HG. In this scenario, we have a perfect utility but no privacy. Scenario 4: Setting t = n and k = n gives a (n, n)- hypergraph anonymization. That is, we publish a hypergraph with (n 1) n hyperedges. In this scenario, we have a perfect privacy but no utility. Distortion ratio Figure 9. Information loss with different values of k. 5 clusters-type1 5 clusters-type2 3 clusters-type1 3 clusters-type K-Size of equivalence classes Figure 10. Distortion ratio with different values of k. p ffiffi that the best value for t is n, obtained from experimental results. In Figures 9 and 10, we show the performance of our approach with and without sampling by type 2 and type 1, respectively. As demonstrated in the figures, different values of (t, k) over possible values of them, bounded by (1, 1) and (n, n), give various results around privacy and utility. For example, four following scenarios describe a privacy-utility trade-off. Scenario 1: Setting t = 1 and k = 1 gives a (1, 1)- hypergraph anonymization. E = E (hyperedges of anonymized hypergraph) and HG (anonymized hypergraph) exactly represents the original hypergraph HG. Thus, every query on HG is answered with the same precision as on HG. In this scenario, we have a perfect utility but no privacy. Scenario 2: Setting t = 1 and k = n gives a (1, n)- hypergraph anonymization. That is, a table with one cluster and an original hypergraph will be published. Recall that HG is similar to the original k-anonymized table. In this scenario, we have a limited utility in answering queries, while privacy is not preserved as much as when t > 1. Among these scenarios, there are many possibilities for making a trade-off between utility and privacy by choosing appropriate values of t and k. Generally, given a (t, k)- hypergraph anonymization, aggregate queries will be answered approximately correct with small k and large t.with increasing k and decreasing t, the accuracy of query answers is reduced, according to (7) and (15). Clearly, a (t, k)-hypergraph anonymization offers more utility and less privacy than (t, k )-hypergraph anonymization k < k and t < t. 8. CONCLUSION In this paper, we considered the problem of anonymizing data in the form of hypergraph for multirelations with more than two attributes. In particular, we investigated an effective clustering algorithm for hypergraph anonymization to find similar vertices in database, modeled as a hypergraph. The novelty of the proposed method compared with the existing clustering-based anonymization methods is to compute and utilize pairwise similarity values for clustering hyperedges. We also provided flexibility in determining k, separately for each cluster, to balance information loss and privacy. Choosing sensitive tuples for sampling provides t-diversity among clusters. Thereupon, distributing sensitive tuples into t clusters results in high privacy in our approach. The experiments confirm that proposed approach offers a strong trade-off between privacy and utility. While we obtain t-diversity among clusters, it is important to obtain l diversity in each cluster where l is number of sensitive tuples in each equivalent class. Our approach does not support this issue. In future work, we plan to extend the algorithm with l diversity and also t closeness to reach more privacy with acceptable utility in comparison with approaches such as in [27]. Our approach also yields a significant speed-up in run-time of the clustering step, especially when performed on large databases. With the aim of improving performance, we plan to use more suitable methods of clustering such as in [29,30]. REFERENCES 1. Riboni D, Pareschi L, Bettini C. JS-reduce: defending your data from sequential background knowledge attacks. IEEE transactions on dependable and secure computing 2012; 8(3): Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 1315

11 (t,k)-hypergraph anonymization A. Asayesh, M. A. Hadavi and R. Jalili 2. Sweeney L. K-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems 2002; 10: Alto P, Golle Ph. Revisiting the uniqueness of simple demographics in the US, in Proceedings of the 5th ACM workshop on Privacy in electronic society, New York, 2006; Kiernan J, Srikant R, Xu Y, Agrawal R. Order preserving encryption for numeric data, in Proceedings of the 2004 ACM SIGMOD international conference on Management of data, Paris, France, June Waters B, Boneh D. Conjunctive, subset, and range queries on encrypted data, in Proceedings of the 4th conference on Theory of cryptography, 2007; Damiani E, De Capitani Di Vimercati S, Jajodia S, Paraboschi S, Samarati P, Ceselli A. Modeling and assessing inference exposure in encrypted databases. ACM Transactions on Information and System Security (TISSEC) 2005; 8(1): Samarati P. Protecting respondents identities in microdata release. IEEE Transactions on Knowledge and Data Engineering 2001; 13(6): Foresti S, Jajodia S, Paraboschi S, Samarati P, Vimercati S. Fragments and loose associations: respecting privacy in data publishing. Very Large Data Bases Endowment(VLDB) 2010; 3: De. C. di Vimercati S, Foresti S, Jajodia S, Paraboschi S, Samarati P, Ciriani V. Fragmentation design for efficient query execution over sensitive distributed databases, in Proceedings of the 29th IEEE International Conference on Distributed Computing Systems, Washington, DC, USA, 2009; Agrawal D, Aggarwal C. On the design and quantification of privacy preserving data mining algorithms, in Proceedings of the twentieth ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems, 2001; Datta S, Wang Q, Sivakumar K, Kargupta H. On the privacy preserving properties of random data perturbation techniques, in Third IEEE International Conference on Data mining, 2003; Rokach L, Elovici Y, Shapira B, Kisilevich S. Efficient multidimensional suppression for k-anonymity. IEEE Transactions on Knowledge and Data Engineering 2010; 22: Cormode G, Srivastava D, Yu T, Zhang Q. Anonymizing bipartite graph data using safe groupings, in Proceeding of the 34th International Conference on Very Large Data Bases (VLDB), vol. 1, Auckland, New Zealand, August 2008; Thompson B, Yao D. Union-split clustering algorithm and social network anonymization, in Proceedings of the 4th International Symposium on Information, Computer, and Communications Security, New York, USA, 2009; LeFevre K, DeWitt DJ, Ramakrishnan R. Incognito: efficient full-domain k-anonymity, in SIGMOD Conference, 2005; Kifer D, LeFevre K, Machanavajjhala A, Chung Chen B. Privacy-preserving data publishing. Foundations and Trends in Databases(ACM) 2009; 2(1-2): Funga BCM, Mohammeda N, Desai BC, Wang K, Chena R. Privacy-preserving trajectory data publishing by local suppression. Information Sciences 2013; 231: Srikant R, Agrawal R. Mining generalized association rules, in Proceedings of the 21th International Conference on Very Large Data base, San Francisco, CA, USA, August 1995; Eltabakh MY, Padma J, Silva YN, Pei He WG, Aref EB. Query processing with K-anonymity. International Journal of Data Engineering (IJDE) 2012; 3(2): Fung BCM, Wang K, Yu PS. Top-down specialization for information and privacy preservation, in The 21st International Conference on Data Engineering (ICDE), 2005; Macqueen JB. Some methods for classification and analysis of multivariate observations, in Procedings of the Fifth Berkeley Symposium on Math, Statistics and Probability, vol. 1, 1967; Aggarwal G, Feder T, Kenthapadi K, et al. Anonymizing tables, in 10th Int l Conference on Database Theory, January Papa DA, Markov IL. Hypergraph partitioning and clustering, in Approximation Algorithms and Metaheuristics, Hartigan JA, Wong MA. A K-means clustering algorithm. Journal of the Royal Statistical Society: Series C: Applied Statistics 1979; 28(1): Chang CC, Thompson B, Wang H, Yao D. Towards publishing recommendation data with predictive anonymization, in Proceedings of the 5th ACM Symposium on Information, Computer and Communications Security, 2010; Agrawal R, Bayardo R. Data privacy through optimal k-anonymization, in Proceedings of the 21st International Conference on Data Engineering(ICDE), 2005; Li T, Venkatasubramanian S, Li N. t-closeness: privacy beyond k-anonymity and l-diversity, in Proceedings of IEEE International Conference on Data Engineering, Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

12 A. Asayesh, M. A. Hadavi and R. Jalili (t,k)-hypergraph anonymization 28. Koudas N, Srivastava D, Yu T, Zhang Q. Aggregate query answering on anonymized tables, in IEEE 23rd International Conference on Data Engineering (ICDE), Istanbul, 2007; Lai J-H, Huang D, Zheng W-S, Wang C-D. A support vector-based algorithm for clustering data streams. IEEE Transactions on Knowledge and Data Engineering 2013; 25(6): Pradeepini G, Jyothi S. An improved k-means clustering algorithm with refined initial centroids. Publications Of Problems & Application in Engineering Research Paper 2013; 4(1). Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 1317

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management