Using The Central Limit Theorem for Belief Network Learning

Size: px

Start display at page:

Download "Using The Central Limit Theorem for Belief Network Learning"

Brooke Cassandra Parker
6 years ago
Views:

1 Usig The Cetral Limit Theorem for Belief Network Learig Ia Davidso, Mioo Amiia Computer Sciece Dept, SUNY Albay Albay, NY, USA,. Abstract. Learig the parameters (coditioal ad margial probabilities) from a data set is a commo method of buildig a belief etwork. Cosider the situatio where we have may complete (o missig values), same-sized data sets radomly selected from the populatio. For each data set we lear the etwork parameters usig oly that data set. I such a situatio there will be o ucertaity (posterior distributio) over the parameters i each graph, however, how will the parameters leart differ from data set to data set? I this paper we show the parameter estimates across the data sets coverge to a Gaussia distributio with a mea equal to the populatio (true) parameters. This result is obtaied by a straightforward applicatio of the cetral limit theorem to belief etworks. We empirically verify the cetral tedecy of the leart parameters ad show that the parameters variace ca be accurately estimated by Efro s bootstrap samplig approach. Learig multiple etworks from bootstrap samples allows the calculatio of each parameter s expected value (as per stadard belief etworks) ad also its secod momet, the variace. Havig the expectatio ad variace of each parameter has may potetial applicatios. I this paper we discuss iitial attempts to explore their use to obtai cofidece itervals over belief etwork ifereces ad the use of Chebychev s boud to determie if it is worth gatherig more data for learig. Itroductio Bayesia belief etworks are powerful iferece tools that have foud may practical applicatios []. A belief etwork is a model of a real world situatio with each ode (N N m ) i the etwork/graph (G) represetig a evet. A edge/coectio (e e l ) betwee odes is a directioal lik showig that a evet iflueces or causes the value of aother. As each ode value ca be modeled as a radom variable (X X m ), the graph is effectively a compact represetatio of the joit probability distributio over all combiatios of ode values. Figure shows a belief etwork modelig the situatio that relates smokig, cacer ad other relevat iformatio. Buildig belief etworks is a challegig task that ca be accomplished by kowledge egieerig or automatically producig the model from data, a process commoly kow as learig []. Learig graphical models from a traiig data set is a popular approach where domai expertise is lackig. The traiig data cotais the occurrece (or otherwise) of the evets for may situatios with each situatio termed a traiig example, istace or observatio. For example, the parameters for the Cacer etwork show i Figure could be leart from may patiet records that cotaied the occurrece or ot of the evets. The four commoly occurrig situatios of learig belief etworks [] are differet combiatios of a) learig with complete (or icomplete) data ad b) learig with kow (or ukow) graph structure. Learig with complete data idicates that the traiig data cotais o missig values, while icomplete data idicates for some traiig examples, pieces of iformatio of the world situatio were ukow. With kow graph structure the learig process will attempt to estimate the parameters of the radom variables while with ukow structure the graph structure is also leart. There exists a variety of learig algorithms for each of these situatios [][]. I this paper we shall focus o the situatio where the data is complete ad graph structure is specified. I this simplest situatio there is o posterior distributio over the model space as there is o ucertaity due to latet variables or missig values. This allows us to focus o the ucertaity due to variatios i the data. I future work we hope to geeralize our fidigs for other learig situatios. Without loss of geerality ad for clarity we shall focus o Boolea etworks. Learig the parameters of the graph from the traiig data D of s records ivolves coutig the occurreces of each ode value (for paretless odes) across all records to calculate the margial probabilities or combiatio of values (for odes with parets) ad ormalizig to derive the coditioal probabilities. Now cosider havig may radomly draw data sets (D D t ) all of size s from which we lear model θ θ t. Each model places a probability distributio over the evet space, which is all combiatios of ode values. How will these parameters vary from model to model? We shall aswer this questio by cosiderig a probability distributio over the parameters for each ode. We shall call these radom variables {Q Q m }. Note that Q i may be a block of radom variables as the umber of parameters per ode ca vary depedig o

2 its umber of parets. I a Boolea etwork for a paretless ode, Q i is a sigle radom variable whose value represets ode parameter, P(X i = TRUE). For odes that have parets i a Boolea etwork, Q i cosists of (#Parets) radom variables represetig the parameters P(X i = TRUE Coditio ) P(X i = TRUE Coditio ^(#Parets) ) with the differet coditios beig the various combiatios of TRUE ad FALSE values of the parets. What ca we say of these cotiuous radom variables that represet the parameter estimates leart from differet radom samples? I this paper we show that these cotiuous radom variables form a Gaussia distributio that ceter o the populatio values (the parameters that would be leart from the etire populatio of data) ad whose variace is a fuctio of the sample size. We show that Q i ~ Gaussia(p i *, p i * (- p i * ) /(ks)) where k is a costat, p i * is the relevat parameters of the geeratig mechaism that produced the data ad s is the sample size. The stadard deviatio of this radom variable approaches zero ad the sample mea coverges asymptotically to the populatio value as the value of s icreases. We ca ow make two estimates for each parameter i the etwork: its expected value ad its variace. I most learig situatios the actual probability distributio over the leart parameters due to ucertaity/variability i the observed data set is ukow. Sometimes, geeralities are kow such as that decisio trees are ustable learers or that iterative algorithms that miimize the distortio (vector quatizatio error) are sesitive to iitial parameters. I this paper we make use of the probability distributio over the leart parameters for two applicatios. These should be treated as iitial attempts to make use of the mea ad variace of each parameter ad future work will refie these applicatios. Firstly, we use the variace to determie if gatherig more data will ecessarily chage the mea value of the parameters by at least some user specified value. This has applicatios whe collectig data is time cosumig or expesive. We the illustrate how to create cofidece itervals associated with each iferece made from the etwork. We begi this paper by describig the learig of belief etworks, readers familiar with this topic may skip this sectio. The applicability of the cetral limit theorem to belief etwork learig with complete data is described ext. We the empirically verify that the cetral tedecy of the distributio of the leart parameters by samplig the data to lear without replacemet from a large collectio of data (geerated by a Gibbs sampler). Havig such a large pool of data is a uusual luxury ad we ext illustrate approximatig the variace associated with the leart parameters usig Efro s bootstrap samplig (samplig with replacemet) []. We ext discuss how to use the estimated variace for our two proposed applicatios. Fially we coclude our work ad describe future directios. Learig Belief Networks The geeral situatio of learig a belief etwork ca be cosidered as cosistig of estimatig all the parameters (margial ad coditioal probabilities), θ, for the odes ad discoverig the structure of the directed acyclic graph, G. Let θ i ={q i q im } be the parameters of the m ode graph leart from data set i. Note that the correspodig upper-case otatio idicates the radom variable for these parameters. As the etwork is Boolea, we oly eed the probability of the ode beig oe value give its coditios (if ay). I later sectios we shall refer to graph ad etwork together as the model of the data, H={θ, G}. This paper focuses o learig belief etworks i the presece of complete data ad a give graph structure. From a data set D i = {d i,, d is }, d il = {a i a ilm } of s records, we fid either the margial or coditioal probabilities for each ode i the etwork after applyig a Laplace correctio. The term a ilj refers to the j th ode s value of the l th istace withi the i th data set. For complete data, kow graph structure ad Boolea etwork the calculatio for ode i is simply: If N j has o parets, q = ( + a = T ) /( s + ) ( ) ij l ilj If N i has parets Parets(N i ) = N k ( ) qijt = + ailj = T ad ailk = T / + ailk = T l l qijf = + ailj = T ad ailk = F / + ailk = F l l Note the subscript T or F refers to the ode s paret s value, the parameter q always refers to probability of the ode takig o the value T. For example q ijf refers to the probability leart from the i th data set that the j th

3 A Cetral Limit Theorem for Belief Networks ode will be T give its paret is F. If a ode has more tha oe paret the the expressio for the parameter calculatios ivolves similar calculatios over all combiatios of the paret ode values. Age Geder Visit Smokig Exposure Cacer Smokig TB Cacer Brochitis TB or Cacer S. Calcium Tumor Figure. The Cacer Belief Network. XRay Result Dyspea Figure. The Asia Belief Network The Asymptotic Distributio of Leart Parameters for Belief Networks Cosider as before a umber of radom samples of traiig data (D D t ) each of size s. We divide our graph of odes ito two subsets, those with o acestors (Paretless) ad those that have acestors ad are hece childre (Childre). We iitially cosider biary belief etworks, whose odes ca oly take o two values T (TRUE) ad F (FALSE) ad childre that have oly oe paret, but our results geeralize to multi-paret odes. For each traiig data (D k ) set we lear a model θ k as described earlier. Uder quite geeral coditio (fiite variace) the parameters of the odes that belog to Paretless will be Gaussia distributed due to the cetral limit theorem [4]. We ca formally show this by makig use of a commo result from the cetral limit theorem, the Gaussia approximatio to the biomial distributio [4]. A similar lie of argumet that follows could have bee established to show the cetral tedecy aroud each ad every joit probability value (P(N N m )) but sice the belief etwork is a compact represetatio of the joit distributio, we show the cetral tedecy aroud the parameters of the etwork. Situatio #: N j Paretless Let p j be the proportio of times i the etire populatio (all possible data) that the j th ode (a paretless ode) is TRUE. If our samplig process (creatig the traiig set) is radom the whe we add/sample a record, the chace the j th ode value is TRUE ca be modeled as the result of a Berouli trial (a Berouli trial ca oly have oe of two outcomes). As the radom samplig process geerates idepedet ad idetically data (IID) the the probability that of the s records i the sample that records will have TRUE for the j th ode is give by a Biomial distributio: s s P s p j =! (,, ) ( p j ) ( p j )!( s )! ( ) Applyig the Taylor series expasio to equatio ( ) with the logarithm fuctio for its dampeig (smoothess) properties ad a expasio poit p j.s yields the famous Gaussia distributio expressio: ( s. p j ) /(σ ) ( 4 ) P( s,, p j ) = e, σ = sp j ( p j ) πσ If the sample size is s the the umber of occurreces of X j =T i ay radomly selected data set will be give by a Gaussia whose mea is s.p j ad variace s.p j (- p j ) as show i equatio ( 4 ). Therefore, Q j ~ Gaussia(p j, p j (- p j )/s)) i the limit whe t ad p j > ε, where p j is the proportio of times X j =T i the populatio. Situatio #. N i Childre ad Paret(N i ) = N j N k A similar argumet as above holds except that there are ow as may Berouli experimets as there are combiatios of paret values. I total there will be k Berouli experimets. Therefore, Q i ~ Gaussia(p i,j k, (p i,j k )(-p i,j k )) / (sc k )). Note that sc k is just the proportio of times that the coditio of iterest (the paret value combiatio) occurred.

4 4 A valid questio to ask is how big should p j (or p i,j k ) ad s (or sc k ) be to obtai reliable estimates. The geeral cosesus i empirical statistics is that if t >, s > ad p j (or p i,j k ) >. the estimates made from the cetral limit theorem are reliable [4]. This is equivalet to the situatio of learig from oe hudred samples, where each paretal coditio occurs at least twety times ad o margial or coditioal probability estimate less tha.. While these coditios may seem prohibitive, we foud that i practice for stadard belief etworks tried (Asia, Cacer ad Alarm) that the distributio of parameters was typically Gaussia. Table shows for the Cacer data set that from 49 traiig data sets each of size traiig istaces that the average of the leart parameters passed a hypothesis test at the 9% cofidece level whe compared to the populatio (true) parameters. Similar results were obtaied for the Cacer ad Alarm data sets. Mea Geder Hypothesised. Mea Cacer Hypothesised.9 Differece betwee meas -.6 9% CI -. to -.99 Mea Age 49. Hypothesised. Differece betwee meas -. 9% CI -.7 to. Mea S.C. 49. Hypothesised.7 Differece betwee meas -. 9% CI -.7 to. Mea Ex.to.Tox Hypothesised.49 Differece betwee meas.4 9% CI -. to.6 Mea Lug Tmr Hypothesised.9 Differece betwee meas.449 9% CI.4 to.6 Mea Smokig Hypothesised.4 Differece betwee meas.64 9% CI -. to. Differece betwee meas -. 9% CI -. to -.9 Table. For the Cacer data set, hypothesis testig of the mea of the leart parameter ad the populatio (true value) of the parameter. I all cases the differece betwee the true value ad the mea of the leart parameters lied i the 9% cofidece iterval. Oly a subset of the hypothesis tests are preseted for space reasos. Usig Bootstrappig to Estimate the Variace Associated With Parameters Havig the ability to geerate radom idepedet samples from the populatio is a luxury ot always available. What is typically available is a sigle set of data D. The bootstrap samplig approach [] ca be used to sample from D to approximate the situatio of samplig from the populatio. Bootstrap samplig ivolves drawig r samples (B B r ) of size equal to the origial data set by samplig with replacemet from D. Efro showed that i the limit the variace amogst the bootstrap sample meas will approach the variace amogst the idepedetly draw sample meas. We use a visual represetatio of the differece betwee the probability distributio over the leart parameter values (49 bootstrap samples each of size istaces) ad a Gaussia distributio with a mea ad stadard deviatio calculated from the leart parameter values as show i Figure. As expected the most complicated ode (smokig) had the greatest deviatio from the ormal distributio.

5 4 Normal Quatile - - Normal Quatile Geder Cacer 4 Normal Quatile - - Normal Quatile Ex.to.Tox S.C Normal Quatile Age Normal Quatile Lug Tmr Normal Quatile Sm okig Figure. For the Cacer data set the differece betwee the distributio over the leart parameter values ad a Gaussia distributio with the mea ad the stadard deviatio of the leart parameter values. The straight lie idicates the Gaussia distributio. The x-axis is the parameter value while the distace betwee the lies is the Kullback-Leibler distace betwee the distributios at that poit.

6 6 Learig ad Usig the Variace of Leart Parameters We ow describe a approach to lear the belief etwork parameters mea ad variace. Firstly, the parameters of a belief etwork are leart from the available data usig the traditioal learig approach determied earlier. Secodly, bootstrap samplig is performed to estimate the variace associated with each leart parameter. Estimatig the variace ivolves takig a umber of bootstrap samples of the available data, learig the parameters of the belief etwork from each ad measurig the variace across these leart parameters. Therefore, for each parameter i the belief etwork, we ow have its expected value ad variace. Estimates of the variace of the parameters could have bee obtaied from the expressios reported i situatio# ad situatio# earlier. However, i practice we foud that these were ot as accurate as calculatig variaces from a large umber (more tha ) bootstrap samples. This was because from our limited traiig data the leart parameter was ofte differet from the true value. I this paper we propose two uses of these additioal parameters which we describe ext: determiig if more data is eeded for learig ad comig up with cofidece itervals over ifereces. Determiig the Need for More Data A cetral problem with learig is determiig how much data is eed (how big s should be). The computatioal learig theory literature [6] addresses this problem, ufortuately predomiatly for learig situatios where o oise exists, that is the Bayes error of the best classifier would be. Though there has bee much iterestig work studyig the behavior of iterative algorithms [7][] this work is typically for a fixed data set. The field of active learig attempts to determie what type of data to add to the iitial traiig data set [9] (the third approach i the paper) [] ad []. However, to our kowledge this work does ot address the questio of whe to stop addig data. We ow discuss the use of Chebychev s boud to determie if more data is eeded for learig. Others [] have used similar bouds (Cheroff s boud) to produce lower boud results for other patter recogitio problems. Sice both the expectatio ad variace of the parameters of iterest are kow, we ca make use of the Chebychev iequality. The Chebychev iequality allows the defiitio of the sample size, s, required to obtai a parameter estimate ( pˆ ) withi a error ( ε ) of the true value from a distributio with a mea of p ad stadard deviatio of σ as show below. σ ( ) P [ pˆ p > ε ] < sε This expressio ca be iterpreted as a upper boud o the chace that the error is larger tha ε. I-tur we ca upper boud the right-had side of this expressio by δ which ca be cosidered the maximum chace/risk we are willig to take that our estimate ad true value differ by more tha ε. The rearragig these terms to solve for the sample size yields: s σ ε δ Typical applicatios of the Chebychev iequality use σ =p(-p) ad produce a boud that requires kowig the parameter that is tryig to be estimated! []. However, we ca use our empirically (through bootstrappig) obtaied variace ad circumvet this problem. The questio of how much data is eeded, is ow aswered with respect to how much error ( ε ) i our estimates we are willig to tolerate ad the chace (δ) that this error will be exceeded. We ca use equatio ( 6 ) to determie if more data is eeded if we are to tolerate learig parameters that with a chace δ differ from the true parameters by more tha ε. The costats δ ad ε are set by the user ad effectively defie the users defiitio of how close is close eough. For each ode i the belief etwork we compute the umber of istaces of each type required. We the compare this agaist the umber of actual istaces i the data set of this type. If for ay ode the required umber is less tha the actual umber the more data eeds to be added. By istace type, we mea istaces with a specific combiatio of paret ode values. There as may istace types as there are eeded margial or coditioal probability table etries. For example, for the Cacer etwork (Figure ) the Exposure ode geerates two types, oe each for whe its paret s (Age) value is T or F, the value of the remaiig odes are ot specified. ( 6 )

7 To test this stoppig criterio we eed to empirically determie whe addig ew istaces to the traiig set adds o iformatio. This ivolves measurig the expectatio (over the joit distributio) of the code word legths for each combiatio of ode values. We wish to oly add istaces to the traiig set if the istaces reduce the mea amout of iformatio eeded to specify a combiatio of ode values. As the traiig data size icreases the parameters leart stabilize ad better approximate the chace occurrece i the populatio. Formally, the expected iformatio is: m ( 7 ) Average# bits = - H *( E i ).log( θ ( E i )) i= where E i is a particular combiatio of ode values of which there are m for a m biary ode etwork. H*(.) is the probability distributio of the geeratig mechaism ad θ (.) is the leart distributio. Figure 4 shows the expected iformatio for differet sized traiig sets for the Asia ad Cacer etworks averaged over experimets. I each of these experimets the traiig data is icremetally added by samplig without replacemet from the populatio (, istaces geerated from a Gibbs sampler). Calculatig this expected iformatio requires a sigificat amout of time ad caot be used for determiig whe to stop addig istaces to the traiig set Mea Iformatio Versus Size of Traiig Set Cacer Graph: Compariso of *Optimal* Active ad Radom Selectio Mea Iformatio Level Active Selectio Radom Selectio Iteratio Figure 4. The mea iformatio for the joit distributio for the Asia (top curve) ad Cacer etworks agaist traiig set size (x). At approximately istaces ad istaces the decrease i iformatio due to addig istaces are egligible for the Asia ad Cacer data sets respectively. For oe hudred experimets we used the test described earlier i italics with ε =. ad δ=.. Whether these results geeralize to more complicated etworks remais as future work. Stoppig Poit (From Figure 4) Mea Over All Trials Stadard Deviatio Over All Trials Asia 9 Cacer 67 9 Table. For trials, the mea umber of istaces i the traiig set before the test based o equatio ( 6 ) specifies addig o more data poits. The correct umber of istaces is visually determied from Figure 4. Cofidece Itervals Over Ifereces Typically with belief etworks a iferece is of the form that ivolves calculatig a poit estimate of P(N i =T {E}) where E is some set of ode values that ca be cosidered as evidece. For small etworks exact iferece is possible but sice exact iferece is NP-Hard for larger etworks approximate iferece techiques such as Gibbs samplig is used []. However, if the variace of each parameter is kow we ca ow create a cofidece iterval associated with P(N i =T {E}). We ow describe a simple approach to achieve this. For a belief etwork parameter of value p ad stadard deviatio σ we kow that with 9% cofidece that its true value is i the rage p±.96σ. We ca ow create three versios of the belief etwork, oe where all etwork parameters are set to p (the expected case) aother where the etwork parameters are set to p-.96σ

8 (lower boud) ad aother where they are set to p+.96σ (upper boud). Poit estimatio is performed i each of the three etworks usig whatever iferece techique is applicable ad the three values are the bought together to retur a expected, lower ad upper boud o the iferece to provide a 9% cofidece iterval. Though it may be thought that the chages amogst these etworks will be trivial ad ot retur greatly varyig ifereces, recet work [4] has show that eve small chages i etwork parameters ca result i large chages i the iferece/query probabilities. Future Work ad Coclusio Our work posed the questio: Give may idepedetly draw radom samples from the populatio, if we lear the belief etwork parameter estimates from each, how will the leart parameters differ from sample to sample. We have show that for complete data ad kow graph structure that the leart parameters will asymptotically distributed aroud the geeratig mechaism (true) parameters. This result follows from the cetral limit theorem ad has may potetial applicatios i belief etwork learig. This meas each ode i the belief etwork is described by the parameter s mea ad variace. The variace parameter measures the ucertaity i the parameter estimates due to variability i the data. However, havig may radom samples from which to estimate the variace is a great luxury. By usig bootstrap samplig we ca create a estimate of the variace by samplig from a sigle data set with replacemet. Havig a probability distributio over the leart parameters due to observed data ucertaity has may potetial applicatios. We describe two. We show how to create upper ad lower bouds o the probabilities of ifereces by creatig three etworks that represet the expected, lower boud ad upper boud. Usig the mea ad variace associated with each parameter allows determiig if more data is required so as to guaratee with a certai probability that the differece betwee the expected ad actual parameter meas are withi a predetermie error. We have limited our results i this paper to the simplest belief etwork learig situatio to esure the ucertaity i the parameters is oly due to data variability. We ited to ext exted our results to the situatio where EM ad SEM is used to lear from icomplete data ad ukow graph structure. Refereces [] F. Jese, A Itroductio to Bayesia Networks, Spriger Verlag, 996 [] J. Pearl, Probabilistic Reasoig i Itelliget Systems, Morga Kaufma [] Friedma, N. (99), The Bayesia structural EM algorithm, i G. F. Cooper & S. Moral, eds., Proc. Fourteeth Coferece o Ucertaity i Artificial Itelligece (UAI '9)', [4] Devore, Peck,., Statistics: The Exploratio ad Aalysis of Data, rd Ed, Wadsworth Publishig [] Efro, B Bootstrap methods: aother look at the jackkife. Aals of Statistics 7: -6. [6] Mitchell, T., Machie Learig, McGraw Hill, 997. [7] Bottou, L., ad Begio, Y., Covergece properties of the k-meas algorithm. I G. Tesauro ad D. Touretzky, editors, Adv. i Neural Ifo. Proc. Systems, volume 7, pages --9. MIT Press, Cambridge MA, 99. [] Rusla Salakhutdiov & Sam T. Roweis & Zoubi Ghahramai, Optimizatio with EM ad Expectatio-Cojugate- Gradiet, Iteratioal Coferece o Machie Learig (ICML), pp [9] D. MacKay. Iformatio-based objective fuctios for active data selectio. Neural Computatio, 4(4):9--64, [] Simo Tog, Daphe Koller, Active Learig for Parameter Estimatio i Bayesia Networks. NIPS : [] Coh, D.; Atlas, L.; ad Lader, R. 994, Improvig Geeralizatio with Active Learig. Machie Learig ():-. [] H. Maila, H. Toivoe, ad I. Verkamo. Efficiet algorithms for discoverig associatio rules. I AAAI Wkshp. Kowledge Discovery i Databases, July 994 [] M. Pradha ad P. Dagum. Optimal mote carlo estimatio of belief etwork iferece. I Proceedigs of the th Coferece o Ucertaity i Artificial Itelligece, pages , Portlad, Orego, 996. [4] Hei Cha ad Ada Darwiche. Whe do Numbers Really Matter? I JAIR 7 () 6-7.

UNIT 4 Section 8 Estimating Population Parameters using Confidence Intervals

UNIT 4 Section 8 Estimating Population Parameters using Confidence Intervals UNIT 4 Sectio 8 Estimatig Populatio Parameters usig Cofidece Itervals To make ifereces about a populatio that caot be surveyed etirely, sample statistics ca be take from a SRS of the populatio ad used