FMC: An Approach for Privacy Preserving OLAP

Size: px

Start display at page:

Download "FMC: An Approach for Privacy Preserving OLAP"

Delphia Park
5 years ago
Views:

1 FMC: An Approach for Privacy Preserving OLAP Ming Hua, Shouzhi Zhang, Wei Wang, Haofeng Zhou, Baile Shi Fudan University, China {minghua, shouzhi_zhang, weiwang, haofzhou, Abstract. To preserve private information while providing thorough analysis is one of the significant issues in OLAP systems. One of the challenges in it is to prevent inferring the sensitive value through the more aggregated non-sensitive data. This paper presents a novel algorithm FMC to eliminate the inference problem by hiding additional data besides the sensitive information itself, and proves that this additional information is both necessary and sufficient. Thus, this approach could provide as much information as possible for users, as well as preserve the security. The strategy does not impact on the online performance of the OLAP system. Systematic analysis and experimental comparison are provided to show the effectiveness and feasibility of FMC. Introduction Online analytic processing (OLAP) is an important infrastructure for advanced data analysis and knowledge discovery. While most of the previous studies on OLAP focus on OLAP models, data cube and data warehouse construction, maintenance and compression, as well as efficient query answering methods, it is critical to investigate the problem of privacy preserving in OLAP query answering. Example (Motivation) Consider a table about the patient cases in some hospitals as shown in Table. Table. a table about the patient cases Hospital Disease Number of cases Forest Lung cancer 6 Forest Diabetes 63 Memorial Diabetes 87 Memorial Heart attack 32 Fig.. The data cubes based on Table Suppose the hospitals do not want to make the population of individual diseases public, but agree to share the total number of all cases in a hospital or the total number of a certain disease in all hospitals. That is, in the data cube based on Table, the <f,*> 79 <f,l> <m,*> 9 <f,d> <*,*> 98 <*,l> 6 <m,d> <*,d> 5 <m,h> <*,h> 32

2 value of cells <f,l>, <f,d>, <m,d> and <m,h> should be hidden from users (as shown in Figure. <f, l> stands for the cell <forest, lung cancer> and so do other cells). A simple and direct security policy is to decline all the access to the sensitive cells. However, such a declining-direct-access policy is insufficient to preserve the privacy. Since just parts of the measure are hidden, the structure of the cube could be found out from the rest columns of the fact table, so the sensitive values could be revealed through other unprotected cells. For example, the value of <f,l> is exactly the same as that of <*,l>, since <*,l> only aggregates this record. Moreover, subtracting the value of <f,l> from that of <f,*> discloses the value of <f,d>. Now, the problem becomes, Can we make up a better security policy so that the privacy is strictly preserved? Moreover, we want such a policy to hide as few information as possible. We call it the privacy preserving OLAP problem. In this paper, we tackle the problem by hiding a minimal set of unprotected cells involved in determining the value of confidential cells, so that the precondition of information leakage will no longer hold. For example, if we hide the cells <*,l> and <*,h> in Figure, the value of the sensitive cells <f,l>, <f,d>, <m,l> and <m,d> will never be obtained by only accessing the remainder unprotected cells. Compared to the privacy control problems in statistical database and data mining, there are several new challenges for the privacy preserving OLAP problem, and we make the following contributions. ) Sensitive data items can be distributed at different granularity level in OLAP. We propose a general model and solution that can handle this case. 2) It is crucial for OLAP systems to provide users with as much information as possible while protecting the sensitive data. We prove that our algorithm only hides the necessary data. 3) OLAP applications usually require short response time. We eliminate the inference before users interacting with the system, so that the algorithm would not affect the online performance of the OLAP system. The rest of the paper is organized as follows. In Section 2, we formulate the problem of privacy preserving OLAP. Then Section 3 provides the overview of the solution. The key techniques are discussed in section 4 and section 5. Extensive experimental results are reported in Section 6. Finally, we draw the conclusion in Section 7.. Related Work Inference control methods in statistical databases are classified into two categories []. Restriction based techniques include auditing all queries [2], suppressing sensitive data [3] and so on. Perturbation based techniques include adding noise to source or outputs to affect the precision of detail data [4]. Inference control for OLAP systems received less attention. However, Lingyu Wang et al. have systematically studied this problem: [] derives sufficient conditions

3 for non-compromisability in sum-only data cubes; [5] discusses the inference problem caused by the multi-dimensional range queries; [6] proposes a method to eliminate both unauthorized accesses and malicious inferences. 2 Problem Definition A data cube consists of a set of dimensions and measures with aggregate functions defined on it. In this paper, we mainly focus on the SUM function. Each node of the data cube is called a cuboid, and a tuple in the cuboid is called a cell. Two cuboids C and C 2 follow the partial order (i.e., C C 2 ), iff on each dimension, either they share the same attribute, or C 2 has a higher-level of attribute in the dimension hierarchy. In this case, we say C 2 is an ancestor of C, and C is a descendant of C 2. C 2 is a father of C, and correspondingly, C is a son of C 2, if C C 2, and there isn t any cuboid C such that C C and C C 2. These definitions apply to cells as well. In Example, cuboids <Hospital, Disease> <Hospital, *>, and the cells <f,l> <f,*>. Decided by the multi-dimensional data model, the access control in OLAP systems lies in cuboids and cells. We define the confidential information as a forbidden set in the form of {c,, c m }, where c i is a cell of the data cube. We assume that the forbidden set includes all the confidential cells and their descendants, since a confidential cell could also be computed by simply aggregating all its descendants. All the cells not included in the forbidden set compose the available set, which is accessible for users. For example, the available set in Example includes all the cells except <f,l>, <f,d>, <m,d> and <m,h>. However, we have shown in Example that some confidential information (such as <f,l> and <f,d>) could be obtained by combining the cells in the available set. We define the available set as well as all the information derived from it as the available set closure. Definition [Available Set Closure]. Given an available set A, the Available Set Closure C(A) is defined as:. If cell c A, c C(A); 2. If cell c C(A), k c C(A), k is a real number; 3. If cells c,c 2 C(A), c +c 2 C(A); When the available set closure and the forbidden set have intersections, inference occurs. In this case, we also say that the forbidden set is compromised. The cells in the available set that cause the inference are called the source of the inference. Definition 2 [Compromisability]. Given a data cube L and a forbidden set F in L, F is compromised when C(L- F) F. To prevent the compromisability, we hide some cells in the source, so that all the sensitive cells couldn t be computed through the incomplete source. However, the hidden cells may also be inferred by higher granular cells, therefore, more cells should be hidden to protect them. Finally we could find a set of cells in addition to the forbidden set, and any cell outside them would not cause inference to the cells inside.

4 Definition 3 [Minimal Cover (MC)]. Given a data cube L and a Forbidden Set F in L, a set S is defined as the Minimal Cover of F (represented as MC(F)) if:. S L-F; 2. C(L-F-S) (F+S)=. 3. S S, C(L-F-S ) (F+S ) The minimal cover is a subset of the available set, and the second condition requires that after hiding the minimal cover, the remainder cells would not cause inference to both the minimal cover and the forbidden set. The third condition claims that any subset of the minimal cover couldn t satisfy the second one, which guarantees that all the cells in the minimal cover are indispensable to eliminate the inference. Problem Statement. Given a data cube L and a forbidden set F, the privacy preserving OLAP problem is to find a minimal cover MC(F) of F, which prevents F from being compromised while prohibiting as few information as possible. 3 Overview of Privacy Preserving OLAP Procedure From the definitions, it is clear that the minimal cover should be free of inference to both the forbidden set and itself; otherwise, one can disclose sensitive information by first inferring the values of minimal cover, and then getting to the forbidden set. A subset of the minimal cover that is only free of inference to the forbidden set is called the minimal partial cover. We take the following two steps to firstly find the minimal partial cover of the forbidden set, and then extend it to the minimal cover to preserve absolute security. Step Finding the minimal partial cover for the forbidden set. We find the minimal partial cover MPC of the forbidden set by linear system theory, such that hiding MPC would eliminate all the inference direct to the forbidden set, but just hiding any subset of MPC would not work. Step 2 Extending the minimal partial cover to the minimal cover. We then take MPC found in step as the new forbidden set, and repeat finding the minimal partial cover for the newly hidden cells until no more cells need to be hidden. 4 Finding Minimal Partial Cover In this section, we will discuss how to find the minimal partial cover for a forbidden set. First, we define the vector code to represent each cell in the cuboid as follows. Definition 5 [Vector Code]. Given a cuboid C, the vector code c v for cell c in C or C s father cuboids is defined as (a,, a n ), where n is the number of cells in C, and a i = if c is the ith cellin C (c C) or a i = if c aggregates the ith cellin C (c Father(C)). otherwise otherwise

5 For example, in the cuboid <Hospital, Disease> in Figure, the vector code of cell <f,*> is (,,,), and the vector for <f, l> is (,,,). The cell corresponding to c v could be inferred by c,, c n, if vector codes c v,, c v n can be linearly combined into the vector code c v. To determine whether it would happen, we discuss the following three cases of the solution of equation (): (x,, x n are real numbers). x c v + + x n c v n =[ c v,, c v n ] [x,, x n ] T = c v () Equation () has no solutions. Cell c corresponding to c v couldn t be computed with any other cells, so no additional information needs to be hidden. Equation () has only one non-zero solution. c v could be computed with a certain combination of c v,, c v n. If x i,, x j are the non-zero components of the solution, then the corresponding cells c v i,, c v j are indispensable to inferring c v. Therefore, just hiding one of c v i,, c v j could prevent the inference. Equation () has more than one non-zero solutions. To eliminate all the inference, we need to hide one cell whose corresponding component of solution X is always non-zero. If there isn t such kind of cells, we need to find a set of cells at least one of which is used in each solution. 4. An Example Based on linear system theory [7], we develop a method to eliminate the inference to certain cells. The method is illustrated in the following example. Example 2. We try to find the minimal partial cover for cell <f,d> in Example, and the security requirements are the same. Suppose c =<f,*>, c 2 =<m,*>, c 3 =<*,l>, c 4 =<*,d>, and c 5 =<*,h>. The corresponding vectors are c v,, c v 5.. We construct the equation by making A=[ c v,, c v 5 ], b= c v ( vector code of <f,d>). AX= X= (2) 2. The solution of equation (2) is X=X +k X, where X =[,,-,,] T, X =[-,-,,,] T, and k is a real number. If the i th component of X is non-zero, then c v i is used to compute <f,d>. For example, if we take k=, then X=[,,-,,] T, (i.e., c v = c v - c v 3 ), which is exactly the case depicted in Example. 3. We try to find a component of X that is always non-zero, or find a set of components at least one of which is non-zero in each X. If k=: X=X, the first and third components are non-zero. If k : by carefully choosing a value for k, the first or the third component can be zero, but the other components will never be zero. Hence, a cell in {c, c 3 } and another one in {c 2, c 4, c 5 } form the minimal partial cover of <f,d>. For example, if we hide {c, c 5 }, <f,d> wouldn t be compromised.

6 Input: The forbidden set F, and the cuboid C Output: A minimal partial cover MPC of F Method: : construct the coefficient matrix A=[ c v c v n ] 2: for each cell c in F 3: if Ax=c v has solutions 4: find the solutions X of Ax= c v 5: find the set of components M c at least one of which is non-zero in each X 7: return MPC= Mc U c F Fig. 2. Algorithm FMPC: finding a minimal partial cover 4.2 Algorithm Now, let us generalize the algorithm of finding the minimal partial cover (Figure 2). Given a forbidden set F in cuboid C, first construct the coefficient matrix A using the unprotected cells in C or C s fathers. Then for each cell c in F, if Ax=c has solutions, find the set of components in the solutions at least one of which is non-zero in each X. Here we use linear system theory [7] to find such cells. The solutions of Ax=c can be represented as x=x +[x,, x r ] [k,, k r ], where x,, x r is the basic solutions of Ax=, and X is a certain solution of Ax=c. There are r independent components in X, taking zero in x and taking respectively in each x i (i=,, r). For example, in figure 3, the last three components are independent. Suppose X [i] and X 2 [i] are non-zero in all the i th components of X to X 3, and X 2 [j] is the independent component taking in X 2, then either X[i] or X[j] is used in X, and the corresponding cells are the minimal partial cover. X X X X 2 X 3 X [ i] X [ i] k = + k 2 k 3 X [ j ] [ ] X 2 i Fig. 3. An example of minimal partial cover Lemma. Given X=X +[X,, X n-r ] [k,, k n-r ] T, the (r+) th to n th component of X are the independent components. If X [i], and only X d [i],, X dj [i] of X [i],, X n-r [i] are non-zero (d,, d j {,, n-r} and i<r+), then:. At least one of the components X[i], X[r+d ],, X[r+d j ] in X would be non-zero. 2. Any subset of components X[i], X[r+d ],, X[r+d j ] could all be zero in X. Lemma 2. Algorithm returns a minimal partial cover of the forbidden set FS. (The proof of Lemma and Lemma 2 are not provided here due to the limit of space.) Independent Components

7 5 Extending the Minimal Partial Cover to Minimal Cover In this section, we employ a level-wise framework to extend the minimal partial cover to the minimal cover to each cuboid of the cube with some optimizing strategies. 5. Two Optimizing Strategies Eliminating Single-son Inference. A cell is called a single-son cell if it has only one child in its son cuboid. All the single-son fathers of the forbidden set are definitely sensitive. In Example, if we hide the two single-son cell <*,l> and <*,h>, all inferences will be eliminated. Thus, in our algorithm we first add all the single-son fathers of the sensitive cells to the minimal cover. It may both eliminate a large part of inference and reduce the number of cells we must check for inference. Finding Candidate Range. In algorithm, we check all the fathers and unprotected siblings of the forbidden cells for inference. However, not all of them are dangerous. Example 3. A two-dimensional cube is shown in Figure 4(a). The cell <a 2,b > marked with * in the cuboid <A,B> is sensitive. We construct the coefficient matrix A for cuboid <A,B> (as shown in Figure 4(b)). The column vectors of A are related with 8 father cells and 5 unprotected cells in cuboid <A,B>. However, only the column vector A[], A[2], A[5], A[6], A[9] and A[] are probable to infer the value of <a 2,b >, because others have all zeros in the corresponding components. We call the sub matrix formed by A[], A[2], A[5], A[6], A[9], A[] and the non-zero components of them the candidate range of the forbidden set (surrounded with dashed in Figure 4(b)). The candidate range could be found by first setting it to the father cells of the forbidden set, and then iteratively add in the cells which intersect with the candidate range. <*,*> < ab, > < ab, 2> < ab 2, > * <a,*> <a2,*> <a3,*> <a4,*> <*,b> <*,b2> <*,b3> A= < ab 3, 3> < ab 3, 4> <a,b> <a,b2> <a2,b>* <a3,b3> <a3,b4> <a4,b3> < ab 4, 3> (a) A two-dimensional data cube (b) The coefficient matrix for cuboid <A,B> Fig. 4. A two-dimensional data cube 5.2 Algorithm We use a level-wise framework to extend the minimal partial cover to minimal cover. As shown in the Algorithm 2 in Figure 5, we first rank the cuboids in the cube ac-

8 Input: The forbidden set FS Output: A minimal cover MC of FS Method: : for each cuboid C* in the cube 2: while FS C* 3: Add single son father to MC 4: find the candidate range CR for FS 5: m=fmpc(fs, CR) //m is the minimal partial cover of FS returned by FMPC 6: FS=FS-FS C* //inference to FS C* has been eliminated 7: MC=MC m 8: FS=FS m //the minimal partial cover should be protected 9: return MC Fig. 5. Algorithm 2 (FMC: a level wise algorithm to find a minimal cover) cording to the ascend order of the granularity level. Then, for each cuboid, we apply the two optimizing strategies, and invoke Algorithm to find the minimal partial cover of the forbidden set in this cuboid. The returned minimal partial cover should be further checked for inference. This process should be repeated until there isn t any new minimal partial cover in the current cuboid. Theorem. Algorithm 2 returns a minimal cover of the forbidden set FS. (The proof of Theorem is based on Lemma and Lemma 2, and is not provided here due to the limit of space.) 6 Experimental Results Implementation. All experiments are conducted on a Pentium4 2.8 GHz PC with 52MB main memory, running Microsoft Windows XP Professional. The algorithm is implemented using Borland C++ Builder 6 with Microsoft SQL Server 2. Data Set. We used the synthetic data sets and real data set TPC-H benchmark for our experiments. In synthetic data sets, we generated data from a Zipfian distribution, skew of the data (z) was varied over,, 2 and 3. The sizes of the data sets vary from 2 to 8 cells, with 3 dimensions and 4 granularity levels in one dimension. Comparison on Different Zipf Parameter. We apply FMC to TPC-H benchmark and the synthetic datasets whose parameter z=,, 2, and 3. We randomly select % of the cells in two cuboids as the forbidden set, and compared the additional cells hidden by FMC and SeCube (L. Wang et al. 24). Figure 6(a) shows the results. When z=, the data is uniformly distributed, fewer additional cells need to be hidden The generator is obtained via ftp.research.microsoft.com/users/viveknar/tpcdskew

9 than that in the skewed case. Because some values of the dimension appear less often in the skewed dataset, these sparse data are the main cause of inference SeCube FMC 2 3 TPCH Z factor of zipfian distribution SeCube FMC 2% 4% 6% 8% % Size of forbidden set (%) (a) Compare on different zipf factors (b) Compare on different forbidden set size Fig. 6. Size of additional protected cells / size of cube We also evaluate the effectiveness of the two optimizing strategies. Figure 7(a) with the size of candidate range shows that at most 5% of the cube needs to be check for inference. Figure 7(b) shows the number of single-son inference cases. Since it takes a significant part in all inference cases, to eliminate the single-son inference first will contribute to the approach greatly..8 Candidate Range TPCH Z factor of zipfian distribution.7.6 Single-son Inference TPCH Z factor of zipfian distribution (a) Size of candidate range/size of cube (b) single-son inference/all inference cases Fig. 7. Experimental result of two optimizing strategies Size of candidate range.5 Candidate Range % 4% 6% 8% % Size of forbidden set (%) Runtime(millisecond) FMC 2% 4% 6% 8% % Size of forbidden set (%) Fig. 8. Size of candidate range / size of cube Fig. 9. Runtime of FMC

10 Comparison on Varied Forbidden Set. We set the zipf parameter to z=, and change the size of forbidden set. Figure 6(b) shows the size of additional cells hidden by SeCube [6] and FMC, where FMC hide fewer cells than SeCube in all cases. Figure 8 demonstrates the candidate range on different forbidden set size. The size of candidate range stays below 4% in all cases, which means that we only need to check 4% of the whole cube for inference. We also tested the runtime of FMC for different size of forbidden set (Figure 9). 7 Conclusions In this paper, we present an effective and efficient algorithm to address the privacy preserving OLAP problem. The main idea is to hide part of the data causing the inference, so that the sensitive information could no longer be computed. We could guarantee that all the information we hide is necessary, and thus as much information as possible can be provided for users while protecting the sensitive data. All work will be done before users interacting with the system, and thus, it would not affect the online performance of the OLAP system. Our algorithm is partially based on the linear system theory, so the correctness could be strictly proved. Experimental results also demonstrate the effectiveness of the algorithm. Future work includes applying the method to other aggregation functions and improving the efficiency of the algorithm. We also plan to extend the work to solve the inference problem caused by involving two aggregation functions in one cube. References. L. Wang, D. Wijesekera: Cardinality-based Inference Control in Sum-only Data Cubes. Proc. of the 7th European Symp. on Research in Computer Security, F. Y. Chin, G. Ozsoyoglu: Auditing and inference control in statistical databases. IEEE Trans. on Software. Eng. pp (Apr. 982) 3. L.H. Cox: Suppression methodology and statistical disclosure control. Journal of American Statistic Association, 75(37): , D. E. Denning: Secure statistical databases under random sample queries. ACM Trans. on Database Syst. Vol. 5(3) pp (Sept. 98) 5. L. Wang, Y. Li, D. Wijesekera, S. Jajodia: Precisely Answering Multi-dimensional Range Queries without Privacy Breaches. ESORICS 23: L. Wang, S. Jajodia, D. Wijesekera: Securing OLAP data cubes against privacy breaches. Proc. IEEE Symp. on Security and Privacy, 24, pages K. Nicholson: Elementary Linear Algebra. Second Edition, McGraw Hill, 24.

Implementation of Aggregate Function in Multi Dimension Privacy Preservation Algorithms for OLAP

324 Implementation of Aggregate Function in Multi Dimension Privacy Preservation Algorithms for OLAP Shivaji Yadav(131322) Assistant Professor, CSE Dept. CSE, IIMT College of Engineering, Greater Noida,