A Post Randomization Framework for Privacy-Preserving Bayesian. Network Parameter Learning

A Post Randomzaton Framework for Prvacy-Preservng Bayesan Network Parameter Learnng JIANJIE MA K.SIVAKUMAR School Electrcal Engneerng and Computer Scence, Washngton State Unversty Pullman, WA. 9964-75 {jma, sva}@eecs.wsu.edu Abstract: - Post Randomzaton technque has been successfully used n statstcal dsclosure lmtaton. The applcaton Post Randomzaton technque to Prvacy-Preservng data mnng s explored n ths paper. The problem prvacy-preservng Bayesan network parameter learnng s consdered as a specfc example. We propose to use post randomzaton technque to randomze the prvacy-senstve varables n learnng Bayesan network parameters from dstrbuted heterogeneous databases. The only requred nformaton from the data set s a set suffcent statstcs for learnng Bayesan network parameters. The proposed method estmates the suffcent statstcs from the randomzed data. We show both theoretcally and expermentally that ths method learns a set accurate parameters, even under large levels randomzaton. We also llustrate the trade f between prvacy and accuracy by smulatons. Key-Words: - Bayesan Network, Prvacy-Preservng Data Mnng, Dstrbuted Heterogeneous Databases, Post Randomzaton. Introducton Prvacy-preservng data mnng deals wth the problem buldng accurate data mnng models over aggregate data, whle protectng prvacy at the level ndvdual records. There are two man approaches to prvacy-preservng data mnng. One approach s to perturb or randomze the data before sendng t to the data mner. The perturbed or randomzed data are then used to learn or mne the models and patterns [,]. The other approach s to use secure multparty computaton (SMC) to enable two or more partes to buld data models wthout every party learnng anythng about the other party s data [4]. Prvacy-preservng Bayesan network (BN) learnng s a more recent topc. Wrght and Yang [] dscuss prvacy-preservng BN structure computaton on dstrbuted heterogeneous databases whle Meng et al. [8] have consdered the prvacy-senstve BN parameter learnng problem. The underlyng method used n both works s to convert the computatons requred for BN learnng nto a seres nner product computatons and then to use a secure nner product computaton method proposed elsewhere. The number secure computaton operatons ncreases exponentally wth the possble confguratons the problem varables. The current work on prvacy-preservng BN learnng focuses on the multparty models, whch requres that every party have some computatonal capablty. Besdes ths model, our work consders a model where there s a data mner who actually does all the computatons for the partcpatng partes. SMC method has the followng two drawbacks: () t assumes a sem-honest model, whch s ten unrealstc n the real world () t requres large volumes synchronzed computatons among

partcpatng partes. Most the synchronzed computatons are overheads due to prvacy requrement. Post randomzaton overcomes the drawbacks SMC method by a trade f between accuracy and prvacy. A malcous party who does not obey the protocol n SMC method can easly get some prvate nformaton other partes whch he wll not be able to f post randomzatons are mplemented to ndvdual data records.. Problem Formulaton Prvacy-Preservng BN learnng nvolves dstrbuted databases, where the database s owned by several partes. If the database s homogeneously dstrbuted, prvacy-preservng BN Learnng s relatvely easy snce every party can send data mner (or other partes) a set suffcent statstcs from hs part the database. Prvacy ndvdual records wll not be breached by sendng suffcent statstcs to other partes or data mner. The problem prvacy-preservng BN learnng from heterogeneous database s that several partes who each own a vertcal porton the database want to learn a global BN for ther mutual benefts but they are concerned about the prvacy ther senstve varables. In ths paper, we consder the problem BN parameter learnng for the case dscrete varables. We consder the followng two models. Model I: There s no data mner; every party has to do some porton the learnng computatons. Every party sends ther randomzed data to those partes who need those data. Model II: There s a data mner who s does all computatons for the partcpatng partes. Every party smply sends all ther randomzed data to the data mner. 3. Prvacy Analyss for Post Randomzaton Consder a database D wth n varables { X X n }, where X takes dscrete values from the set S. The post randomzaton for varable X s a (random) mappng R : S S, based on a set transton probabltes p lm p( X km X kl ), where km, kl S and X denotes the (randomzed) varable value correspondng to varable probablty X. The transton plm s the probablty that a varable X wth orgnal value k l s randomzed to the value k m. Post Randomzaton s so named because the randomzaton happens after data have been collected. Let P { p lm} denote the K K matrx that has plm as ts ( l, m) th entry, where K s the cardnalty the set S. The condton that P s nonsngular has to be mposed f we want to estmate the frequency dstrbuton varable from the randomzed varables. In the followng, we gve out some smple but effectve post randomzaton schemes on whch our experments are based. If varable X takes bnary values, we can use Bnary Randomzaton as shown n Fg. (a). If the varable s ternary, ternary symmetrc channel as shown n Fg.(b) can be used. p p p p (a) Fg. : X p p p Randomzaton Schemes We can apply the same randomzaton schemes ndependently to all the varables: unform randomzaton to the data set. Alternatvely, we can use a non-unform randomzaton where dfferent post randomzaton schemes are appled to dfferent varables ndependently. The non-unform randomzaton s effectve when dfferent varables have dfferent senstvty levels. For example, we can choose dfferent randomzaton parameters p 3 p (b) 3

and p to dfferent bnary varables for non-unform randomzaton f the prvacy requrement the two varables are dfferent. The non-unform randomzaton ncludes the specal case when there s no prvacy requrement for some the varables. From the above, we can see that f varable X takes K values (or categores), the dmenson P wll be K K. Wth larger K, more randomzaton s ntroduced nto varable general. Ths s good from a prvacy pont vew. However, the varances the estmators for frequency counts wll also be lager under the same sample sze. One soluton for ths problem s to partton the K categores nto several groups such that a value n one group can only be randomzed to a value n the same group. In ths case, Matrx P becomes a block dagonal matrx. The problem how many groups should the K values be parttoned nto s a matter desgn choce. The post randomzaton can also be mplemented to several varables smultaneously. For example, the varables X X and X j can be randomzed smultaneously accordng to transton probablty p X l, X l X k, X ). ( j j k Randomzng varables smultaneously can avod the possble nconsstency the database caused by randomzaton. We consder the noton prvacy ntroduced by Evfmevsk et al. [5] n terms an amplfcaton factor γ. The amplfcaton γ n [5] s proposed n the framework where every data record should be randomzed wth a factor greater than γ, before the data are sent to the data mner, to lmt prvacy breach. However, n ths paper, we use the amplfcaton γ purely as a worst-case quantfcaton prvacy for a desgned post randomzaton scheme. It s proved n [5] that f the randomzaton operator s at most γ amplfyng, revealng n X k wll cause nether an upward ρ -to- ρ prvacy breach nor a downward ρ -to- ρ prvacy breach f ρ ρ ρ ρ > γ. Clearly, the smaller the value γ, the better s the worst case prvacy. Ideally we would lke to have γ. The at most γ amplfcaton provdes a worst case quantfcaton prvacy. However, t does not provde any nformaton about prvacy n general. Besdes γ, we use K mn #{ k P( X k' X k) > }, whch s k the mnmum number possble categores that can be randomzed to category k' n a desgned post randomzaton, as another quantfcaton prvacy. Ths K ndcates the prvacy preserved n general. It s smlar to the K defned n K-anonymty n [9] but n probablstc sense. If we group the categores a varable nto several group, then K become smaller n general 4. Post Randomzaton Framework for Parameter Learnng For parameter learnng, we assume the structure G s fxed and known to every partcpatng party. For Model I, we use the defnton cross varable and cross parents defned n [3]. N jk s the number records such that X s n k th category whle ts parents are n j th category. For each party a () Randomze cross parents at same ste accordng to ther respectve prvacy requrements usng post randomzaton descrbed n Secton 3. Randomzatons are done ndependently for each (combned) varable and each record. () Send randomzed cross parents party a for party a j to party a j together wth the probablty transton matrx used. (3) Learn parameters for local varables n party a. Ths step does not nvolve randomzed data. (4) Estmate the suffcent statstcs Njk s for each

cross varable at same ste usng local data and randomzed parent data from other partes. (5) Compute the parameters for cross varables usng the estmated suffcent statstcs Nˆ s. (6) Share the parameters wth all other partes. Local varables at each ste are not randomzed for local calculatons. Steps learnng parameters for model II: For each party a : () Randomze all senstve varables accordng to ther respectve prvacy requrements usng post randomzaton descrbed n Secton 3. Randomzatons are done ndependently for each (combned) varable and each record. () Send randomzed data and ther correspondng probablty transton matrces to the data mner. For the data mner: () Estmate the suffcent statstcs Njk jk for each node X usng the randomzed data from partcpatng partes. () Estmate the parameters usng the estmated suffcent statstcs Nˆ. jk (3) Broadcast the parameters to all partes. The detals estmaton suffcent statstcs and parameter (step 4 and 5 for Model I, Step and for data mner n Model II) from randomzed data are descrbed n Secton 5. 5. Estmaton Suffcent Statstcs & Parameters from Randomzed Data The problem prvacy-preservng BN Parameter learnng can be decomposed nto a seres estmaton N jk s for each node X and a gven fxed structure G from the randomzed data D. Consder the followng general case: Varable wth cardnalty X K has Q parent nodes Pa ( ),, Pa ( Q). The cardnalty Pa (q) s K (q) Pa. These varables can be arbtrary vertcally parttoned to dfferent partes n both models. The randomzaton each (combned) varable can also be done by groupng the categores the varables nto groups. We have the followng dfferent cases for estmatng Njk s from the randomzed data D due to smultaneous randomzaton. (a) X and ts parents are all randomzed ndependently each other. (b) Some parents X are randomzed smultaneously. (c) X s randomzed smultaneously wth some ts parents. (d) X s randomzed smultaneously wth non-parent varables. For (b) and (c) above, we can consder the smultaneously randomzed varables as combned varables n estmatng the suffcent statstcs. For example, f varable X s randomzed smultaneously wth one ts parents Pa (), N jk s equal to the number records such Q that ( X ; Pa ()) ( k, j ), Pa ( ) j,, Pa ( Q) j, where ( X; Pa ()) s a combned varable. Thus, we can estmate the Njk s from the randomzed data by consderng ( X; Pa ()) as a sngle varable wth cardnalty K (). For case (d), snce the current K Pa Njk doesn't nvolve the varable randomzed smultaneously wth X, the data mner can get the margnal transton probablty matrx from the gven transton matrx the combned varable. From the above arguments, we conclude that the cases (b), (c), and (d) above can effectvely be consdered to be equvalent to case (a). Hence, wthout loss generalty, we can dscuss case (a) only. We denote by Pa X ) as a compound varable ( for all the parents Varable Pa( X ) takes J K Q q Pa q) X. Hence ( dfferent values.

K N j N jk k and N s J K dmensonal vector N jk values, that s t ( N, N,, N K, N, NJ K, where N ) superscrpt t denotes transpose. N (l) s an element N. N jk, N j and N are defned smlarly as N jk, Nj and N respectvely but for the randomzed data D. N jk, Nj and Nˆ jk, Nˆ j and Nˆ are estmators N respectvely. Gven the tranng data D wth N records varables X and ts Q parents n the above general case, f they are post-randomzed wth transton probablty Pa () Pa (Q) matrces P, P,, P, respectvely, we have the followng theorem. Theorem : E t [ N D] P N, where We can use the estmated suffcent statstcs to get ML estmate the parameters as Nˆ jk Nˆ jk θˆ jk K N and the MAP estmate the ˆ j Nˆ k parameters as θ ˆ dstrbuton jk jk α α jk j + Nˆ + Nˆ jk j, where the pror θj s assumed to be Drchlet wth parameter α, α,, α }. The dstrbuton estmator { j j jr θˆ jk s dscussed n [7]. One mportant result from [7] s that the dstrbuton the estmator θˆ jk can be approxmated as a normal dstrbuton wth mean θ jk and wth a varance the order, where N N s the tranng sample sze. P pa pa pa ( ) pa () pa ( Q) and P P P P P P, denotes Kronecker matrx product. Moreover, J K Cov[ N D] N ( l) Vl where Vl s a l covarance matrx such that ts P( l, l)( P( l, l)) f l l V l ( l, l). P( l, l ) P( l, l) f l l K J K J ( l, l) th element s Pros are omtted here due to the page lmtatons. Interested readers can refer to a longer verson ths paper for detals [7]. The followng theorem establshes the bas and varance the estmator N N. Its pro s straght-forward ˆ t ( P ) and s omtted. Theorem : for ˆ t ( P ) N N s an unbased estmator N and ˆ t Cov{ N D} ( P ) Cov{ N D}( P ), where P and Cov{ N D} are defned n Theorem. 6. Expermental Results 6. Non-unform Randomzaton In ths experment, we use the Bayesan Network shown n Fg., where the varables are dstrbuted over three stes. All varables are bnary except varables L and B whch are ternary. The condtonal probabltes the dfferent nodes are also shown., samples were generated from ths Bayesan Network to form the dataset D. Ths data was then randomzed accordng to the scheme descrbed n Table, where varables T, S, and G were consdered not senstve and hence not randomzed. The correspondng at most γ amplfcaton s also shown n Table. K for Bnary randomzaton whereas K3 for ternary randomzaton. Table shows a part parameters learnt from the randomzed data usng the algorthm descrbed n Secton 4 for Model II. Less randomzaton occurs n Model I, so the results for Model I are better than

those for Model II. The remanng part can be calculated by one mnus the gven part. All the values n the Table are average over 5 ndependent runs, wth the correspondng standard devaton ndcated n parenthess. It s clear from the Table that the proposed algorthms can accurately learn the BN parameters for both scenaros, even for moderate levels randomzaton. A T X Ste A.7,.3 T.,.9,.9,. S.5,.5 L.3,.7,.4,.5,.3,.5 X.,.6,.8,.4 F.5,.9,.75,. E.5,.8,.5,.5,.3,.4,.75,.,.85,.5,.7,.6 D.7,.65,.,.4,.8,.35,.3,.35,.9,.6,.,.65 C.9,.4,.6,.5,.,.6,.4,.75 B.8,.5,.,.5,.,.35 F E G.,.4,.8,.6 Fg. : A Bayesan Network for experment 6. A,D Bnary symmetrc p p. 5, γ 3 L,B Ternary symmetrc p p. 5 γ 4. 67 E Bnary symmetrc p p. γ 4 X Bnary symmetrc p p. γ 4 C,F Bnary p. p. 5 γ 9 C L Ste 3 Ste Table : Randomzaton performed 6. Trade f between Prvacy and Accuracy In ths experment, we use the Bayesan network shown n Fg. 3, where varables are dstrbuted over two stes. All Varables are bnary. We generated, samples from ths Bayesan Network. In order to see the trade f between prvacy and accuracy, we randomze the samples usng bnary symmetrc randomzaton wth dfferent levels p p p and learn parameters from randomzed samples usng the method dscussed n Secton 4. As n prevous experment, we only present the results G D S B usng Model II. In the experment usng Model II, every varable s randomzed usng the Symmetrc Bnary Randomzaton wth the same randomzaton level p. Snce parameters assocated wth a node s nothng but the condtonal probablty gven ts parents, the accuracy parameters assocated wth a node can be measured by condtonal Kullback-Lebler (CKL) dstance between the parameters learnt from randomzed data and those learnt from non-randomzed data. The CKL dstance for node n our case s J () ( p) D( X, p) P( pa j) DKL( P ( X pa j), P ( X pa j)), j () ( p) where P ( X pa j) and P ( X pa j) are the parameters learnt from non-randomzed data and A.7(.7) T.(.5).9(.77) S.5(.) X.(.8).6(.) L.3(.49).7(.57).39(.64).4(.55) B.8(.77).6(.4).94(.7).49(.73) E.5(.).8(.9).4(.7).5(.).3(.6).4(.34) D.69(.).65(.3).(3.3).38(.77).79(.7).39(5.65) C.9(.).38(.6).6(.6).5(.) F.4(.73).9(.) G.(.3).4(.9) Table : Mean and standard devaton ( - ) over 5 runs parameters learnt from the randomzed data. those learnt from randomzed data wth randomzaton level p respectvely and D denotes the ordnary KL dstance between two dstrbutons. We present those dstances assocated wth node C, node D and node F n Fg. 4. Those nodes are typcal nodes for the gven Bayesan network. The averages are over ndependent runs. Average plus one standard devaton runs s also depcted (wth dotted lne). From the Fg. 4, we can clearly see the trade f between accuracy and prvacy. Snce we use the symmetrc bnary randomzaton, more prvacy s preserved wth bgger p when p <. 5. Wth, tranng samples, the method stll gets KL

good accuracy when p.3. condtonal KL dstance condtonal KL dstance condtonal KL dstance A.5.5 G.3.7 B.8...8 E.8...8 C.5.9.7.35.5..3.65 D.95.5.75..5.85.5.9 F.8...8 Fg. 3: A Bayesan Network for experment 6..5.5 Avg Avg+Std Node C...3.4 randomzaton parameter p.8.6.4. Avg Avg+Std Node D...3.4 randomzaton parameter p..5 A B Ste C Avg Avg+Std Node F...3.4 randomzaton parameter p Fg. 4: CKL dstance vs. randomzaton level p 6.3 Tranng Sample sze As ponted out n Secton 4, the varance the D E Ste F G estmator parameter θ jk s the order one over the sample sze N. Thus, under the same accuracy requrement, more prvacy can be preserved f there are more tranng samples. Ths experment s performed to llustrate the effect tranng sample 8 sze. We generated 5 tranng samples usng Bayesan network n Fg. 3. The proposed method n Secton 4 s used to learn the Bayesan parameters from randomzed data wth randomzaton levels p., p., p. 3, and p. 4 wth tranng sample sze k 5 ( k 8 ) respectvely. The experment results are shown n Fg. 5. As n experment 6., the average s over ndependent runs and the average plus one standard devaton s also shown. The experment results for randomzaton level p. 4 are shown separately. Those Condtonal dstances out the scale vertcal axs are not shown n the Fgure. From ths experment, we can clearly see that tranng sample sze play a key role n the trade f between accuracy and prvacy. We can see that when the tranng sample sze s very large, we can have both good prvacy and good accuracy. 7. Concluson We have proposed a post randomzaton technque to learn parameters a Bayesan network from dstrbuted heterogeneous data. Our method estmates the suffcent statstcs from the randomzed data, whch are subsequently used to learn the parameters. Our experments show that post randomzaton s an effcent, flexble, and easy-to-use method to learn Bayesan network parameters from prvacy senstve data. Currently, we are explorng the extenson post randomzaton technques to learn BN structure from senstve data. The dea estmatng suffcent statstcs from randomzed data can be used to learn other data mnng models lke decson trees. We plan to report these extensons and applcatons n a future publcaton.

Condtonal KL dstance Condtonal KL dstance Condtonal KL dstance Condtonal KL dstance..8.6.4. node C when p.,. and.3 Avg(p.) Avg+Std(p.) Avg(p.) Avg+Std(p.) Avg(p.3) Avg+Std(p.3) 3 4 5 6 7 8 tranng sample sze5* k..8.6.4. node D when p.,. and.3 Avg(p.) Avg+Std(p.) Avg(p.) Avg+Std(p.) Avg(p.3) Avg+Std(p.3) 3 4 5 6 7 8 tranng sample sze5* k.5.4.3.. node F when p.,. and.3 Avg(p.) Avg+Std(p.) Avg(p.) Avg+Std(p.) Avg(p.3) Avg+Std(p.3) 3 4 5 6 7 8 tranng sample sze5* k.5.5.5 node C, D, F when p.4 Avg(C) Avg+Std(C) Avg(D) Avg+Std(D) Avg(F) Avg+Std(F) 3 4 5 6 7 8 tranng sample sze5* k Fg. 5: CKL dstance vs. tranng sample sze References: [] D. Agrawal and C. C. Aggarwal. On the Desgn and Quantfcaton Prvacy Preservng Data Mnng Algorthm, SIGMOD [] R. Agrawal and R. Srkant. Prvacy-preservng data mnng. In Proceedngs SIGMOD Conference on Management Data, pages 439-45, May. [3] R. Chen, K. Svakumar, and H. Kargupta, Collectve Mnng Bayesan Networks from Dstrbuted Heterogeneous Data, Knowledge and Informaton Systems Journal, vol. 6, 4. [4] C. Clfton, M. Kantarcoglu, J. Vadya, X. Ln, and M. Zhu. Tools for Prvacy Preservng Dstrbuted Data mnng. ACM SIGKDD Exploratons, 4():8-34, 3. [5] A. Evfmevsk, J. Gehrke, and R. Srkant. Lmtng prvacy breaches n prvacy preservng data mnng. In proceedngs the ACM SIGMOD/POD Conference, pages -, San Dego, CA, June 3. [6] J. M. Gouweleeuw, P. Kooman, L.C.R.J. Wllenborg, and P.-P. de Wolf. Post Randomsaton for Statstcal Dsclosure Control: Theory and Implementaton. Journal fcal Statstcs, Vol.4 998 pages 463-478. [7] J. Ma and K. Svakumar, Prvacy-Preservng Bayesan Network Learnng Usng Post Randomzaton, (n preparaton), 5. [8] D. Meng, K. Svakumar and H. Kargupta. Prvacy-Senstve Bayesan Network Parameter Learnng. In the Fourth IEEE Internatonal Conference on Data Mnng. Brghton, UK. November 4. [9] L.Sweeney. k-anonymty: a model for protectng prvacy. Internatonal Journal on uncertanty, Fuzzness and Knowledge-based Systems, (5):557-57,. [] R. Wrght and Z. Yang. Prvacy Preservng Bayesan Network Structure Computaton on Dstrbuted Heterogeneous Data. In Proceedngs the 4 ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng.