Multi-Agent Decision Tree Learning from Distributed Autonomous Data. Sources. D. Caragea, A. Silvescu, and V. Honavar

Size: px

Start display at page:

Download "Multi-Agent Decision Tree Learning from Distributed Autonomous Data. Sources. D. Caragea, A. Silvescu, and V. Honavar"

Leon Banks
5 years ago
Views:

1 Multi-Aget Decisio Tree Learig from Distributed Autoomous Data Sources D. Caragea, A. Silvescu, ad V. Hoavar Iowa State Uiversity Computer Sciece Departmet Artificial Itelligece Research Group Ames, IA INTRODUCTION Recet advaces i computig, commuicatios, ad digital storage techologies, together with developmet of high throughput data acquisitio techologies have made it possible to gather ad store large volumes of data i digital form. For example, advaces i high throughput sequecig ad other data acquisitio techologies have resulted i gigabytes of DNA, protei sequece data, ad gee expressio data beig gathered at steadily icreasig rates i biological scieces. Orgaizatios have begu to capture ad store a variety of data about various aspects of their operatios (e.g., products, customers, ad trasactios). Complex distributed systems (e.g., computer systems, commuicatio etworks, power systems) are equipped with sesors ad measuremet devices that gather ad store, a variety of data for use i moitorig, cotrollig, ad improvig the operatio of such systems. These developmets have resulted i uprecedeted opportuities for large-scale data-drive kowledge acquisitio with the potetial for fudametal

2 gais i scietific uderstadig (e.g., characterizatio of macromolecular structure-fuctio relatioships i biology) i may data-rich domais. I such applicatios, the data sources of iterest are typically physically distributed ad are ofte autoomous. Give the large size of these data sets, gatherig all of the data i a cetralized locatio is geerally either desirable or feasible because of badwidth ad storage requiremets. I such domais, there is a eed for kowledge acquisitio systems that ca perform the ecessary aalysis of data at locatios where the data ad the computatioal resources are available ad trasmit the results of aalysis (kowledge acquired from the data) to locatios where they are eeded. I other domais, the ability of autoomous orgaizatios to share raw data may be limited due to a variety of reasos (e.g., privacy cosideratios). I such cases, there is a eed for kowledge acquisitio systems that ca lear from statistical summaries of data (e.g., couts of istaces that match certai criteria) that are made available as eeded from the distributed data sources i the absece of access to raw data. Thus distributed learig systems have to itegrate may heterogeeous, relatively autoomous compoets (e.g., data repositories, servers, user-supplied data trasformatios ad learig algorithms) ad provide support for data aalysis where the data ad computatioal resources are available. Aget-orieted software egieerig (Jeigs ad Wooldridge, 2001; Hoavar et al., 1998) offers a attractive approach to implemetig modular ad extesible distributed computig systems. For the purpose of this discussio, a itelliget aget is a ecapsulated iformatio processig system that is situated i some eviromet ad is capable of flexible, autoomous actio withi the costraits of the eviromet so as to achieve its desig obectives. The distributed learig scearios described above call for the desig of distributed itelliget learig agets with provable covergece properties with respect to the batch sceario (i.e., whe all of the data is available i a cetral locatio).

3 Agaist this backgroud, this chapter presets a approach to the desig of distributed itelliget learig agets. We precisely formulate a class of distributed learig problems, discuss the commo types of data fragmetatio that result from the distributed ature of the data (vertical fragmetatio ad horizotal fragmetatio), ad preset a geeral strategy for trasformig a large class of traditioal batch learig algorithms ito distributed learig algorithms. The we demostrate a applicatio of this strategy to devise itelliget learig agets for decisio tree iductio (usig a variety of commoly used splittig criteria) from horizotally or vertically fragmeted distributed data, ad show that the algorithms uderlyig the devised agets are provably exact i that the decisio tree costructed from distributed data is idetical to that obtaied by the correspodig algorithm i the batch settig. We also provide a aalysis of the time ad commuicatio complexity of the algorithms uderlyig the proposed agets. The distributed decisio tree iductio agets described i this chapter have bee implemeted as part of INDUS, a aget-based system for data-drive kowledge acquisitio from heterogeeous, distributed, ad autoomous data sources. INDUS is beig applied to several kowledge discovery problems i molecular biology ad etwork-based itrusio detectio. We coclude this chapter with a discussio of related research ad brief outlie of future research directios. DISTRIBUTED LEARNING The problem of learig from distributed data sets ca be summarized as follows: The data sets D, D, 1 2, D are distributed across multiple autoomous sites 1,2,, ad the learer's task is to acquire useful kowledge from this data, h. For istace, such kowledge might take the form of a decisio tree or a set of rules for patter classificatio. I such a settig learig ca be

4 accomplished by a aget A that visits the differet sites to gather the iformatio eeded to geerate a suitable model (e.g., a decisio tree) from the data (serial distributed learig, Figure 1). h A I 1 I 2 I D 1 D 2 D Figure 1: Serial distributed learig. Alteratively, the differet sites ca trasmit the iformatio ecessary for costructig the decisio tree to the learig aget A situated at a cetral locatio (parallel distributed learig, Figure 2). h A I 1 I 2 I D 1 D 2 D Figure 2: Parallel distributed learig. We assume that it is ot feasible to trasmit raw data betwee sites. Cosequetly, the learer has to rely o iformatio I, 1, I 2, I (e.g., statistical summaries such as couts of data tuples that satisfy particular criteria) extracted from the sites. Our approach to learig from distributed data sets ivolves idetifyig the iformatio requiremets of existig learig algorithms, ad

5 desigig efficiet meas of providig the ecessary iformatio to the learer while avoidig the eed to trasmit large quatities of data (Caragea et al., 2000). Exact Distributed Learig We say that a distributed learig algorithm L d (e.g., for decisio tree iductio from distributed data sets), embedded ito a aget A, is exact with respect to the hypothesis iferred by a batch learig algorithm L (e.g., for decisio tree iductio from a cetralized data set) if the hypothesis produced by L d usig distributed data sets D, D, 2, D 1, stored at sites,2,, 1 (respectively), is the same as that obtaied by L from the complete data set D obtaied by appropriately combiig the data sets D D, D,, 1 2. Similarly, we ca defie exact distributed learig with respect to other criteria of iterest (e.g., expected accuracy of the leared hypothesis). More geerally, it might be useful to cosider approximate distributed learig i similar settigs. However, the discussio that follows is focused o exact distributed learig. Horizotal ad Vertical Data Fragmetatio I may applicatios, the data set cosists of a set of tuples where each tuple stores the values of relevat attributes. The distributed ature of such a data set ca lead to at least two commo types of data fragmetatio: horizotal fragmetatio wherei subsets of data tuples are stored at differet sites; ad vertical fragmetatio wherei subtuples of data tuples are stored at differet sites. Assume that a data set D is distributed amog the sites 1,2,, cotaiig data set fragmets D, D, 1 2, D. We assume that the idividual data sets D, D, 1 2, D collectively cotai eough iformatio to geerate the complete dataset D. I may applicatios, it might be the case that the

6 idividual data sets are autoomously owed ad maitaied. Cosequetly, the access to the raw data may be limited ad oly summaries of the data (e.g., umber of istaces that match some criteria of iterest) may be made available to the learer. Eve i cases where access to raw data may ot be limited, the large size of the data sets makes it ifeasible to assemble the complete data set D at a cetral locatio. Horizotal Fragmetatio I the case of horizotal fragmetatio, the data is distributed i such a maer that each site cotais a set of data tuples. The uio of all these sets costitutes the complete dataset. If the idividual data sets (horizotal fragmets) are deoted by D D, D,, 1 2, ad the correspodig complete data set by D, the Horizotally Distributed Data (HDD) has the followig property: D D D 1 2 D, where deotes set uio. Hece, i this case, a distributed learig algorithm L d is exact with respect to the hypothesis iferred by a learig algorithm L if it is the case that: L D, D,, D ) L( D D D ). The challege is to achieve this guaratee d ( without providig L d with simultaeous access to D D, D,, 1 2. Vertical Fragmetatio I this case, each data tuple is fragmeted ito several subtuples each of which shares a uique key or idex. Thus, differet sites store vertical fragmets of the data set. Each vertical fragmet correspods to a subset of the attributes that describe the complete data set. It is possible for some attributes to be shared (duplicated) across more tha oe vertical fragmet, leadig to overlap betwee the correspodig fragmets. Let A A, A,, 1 2 idicate the set of attributes whose values are stored at sites 1,2,,, respectively, ad let A deote the set of attributes that are used to

7 describe the data tuples of the complete data set. The i the case of Vertically Distributed Data (VDD), we have: A A A 1 2 A. Let D, D, 1 2, D deote the fragmets of the dataset stored at sites 1,2,,, respectively, ad let D deote the complete data set. Let the ith tuple i a data fragmet D be deoted as t i D. Let t i idex D. deote the uique idex associated with tuple t i D ad let deote the oi operatio. The the followig properties hold for VDD: i i D D D D, D, D, t. idex t idex. Thus, the subtuples from the vertical data 1 2 k D D. fragmets stored at differet sites ca be put together usig their uique idex to form the correspodig data tuples of the complete dataset. It is possible to evisio scearios i which a vertically fragmeted data set might lack uique idices. I such a case, it might be ecessary to use combiatios of attribute values to ifer associatios amog tuples (Bhatagar ad Sriivasa, 1997). I what follows, we will assume the existece of uique idices i vertically fragmeted distributed data sets. I the case of vertically fragmeted data, a distributed learig algorithm L d is exact with respect to the hypothesis iferred by a learig algorithm L if it is the case that: L ( D, D2,, D ) L( D1 D2 D ). The challege is to guaratee this without providig L d d 1 with simultaeous access to D 1, D2,, D. Trasformig Batch Learig Algorithms ito Exact Distributed Learig Algorithms Our geeral strategy for trasformig a batch learig algorithm (e.g., a traditioal decisio tree iductio algorithm) ito a exact distributed learig algorithm ivolves idetifyig the iformatio requiremets of the algorithm ad desigig efficiet meas for providig the eeded iformatio to the learig aget while avoidig the eed to trasmit large amouts of data. Thus,

8 we decompose the distributed learig task ito distributed iformatio extractio ad hypothesis geeratio compoets. The feasibility of this approach depeds o the iformatio requiremets of the batch algorithm L uder cosideratio ad the (time, memory, ad commuicatio) complexity of the correspodig distributed iformatio extractio operatios. I this approach to distributed learig, oly the iformatio extractio compoet has to effectively cope with the distributed ature of data i order to guaratee provably exact learig i the distributed settig i the sese discussed above. Suppose we decompose a batch learig algorithm L i terms of a iformatio extractio operator I that extracts the ecessary iformatio from data set ad a hypothesis geeratio operator H that uses the extracted iformatio to produce the output of the learig algorithm L. That is, L( D) H ( I( D)). Suppose we defie a distributed iformatio extractio operator I d that geerates from each data set D i, the correspodig iformatio I i =I d (D i ), ad a operator C that combies this iformatio to produce I(D). That is, the iformatio extracted from the distributed data sets is the same as that used by L to ifer a hypothesis from the complete dataset D. That is, C I ( D ), I ( D ),, I ( D )) I( ). Thus, we ( d 1 d 2 d D ca guaratee that L D, D,, D ) H ( C[ I ( D, D,, D )]) will be exact with respect to L( D) H ( I( D)). d ( 1 2 d 1 2 AGENT BASED DECISION TREE INDUCTION FROM DISTRIBUTED DATA Decisio tree algorithms (Quila, 1986; Breima et al., 1984; Bua ad Lee, 2001) represet a widely used family of machie learig algorithms for buildig patter classifiers from labeled traiig data. They ca also be used to lear associatios amog differet attributes of the data. Some of their advatages over other machie learig techiques iclude their ability to: select from all attributes used to describe the data, a subset of attributes that are relevat for classificatio;

9 idetify complex predictive relatios amog attributes; ad produce classifiers that ca be traslated i a straightforward maer, ito rules that are easily uderstood by humas. A variety of decisio tree algorithms have bee proposed i the literature. However, most of them select recursively, i a greedy fashio, the attribute that is used to partitio the data set uder cosideratio ito subsets util each leaf ode i the tree has uiform class membership. The ID3 (Iterative Dichotomizer 3) algorithm proposed by Quila (Quila, 1986) ad its more recet variats represet a widely used family of decisio tree learig algorithms. The ID3 algorithm searches i a greedy fashio, for attributes that yield the maximum amout of iformatio for determiig the class membership of istaces i a traiig set S of labeled istaces. The result is a decisio tree that correctly assigs each istace i S to its respective class. The costructio of the decisio tree is accomplished by recursively partitioig S ito subsets based o values of the chose attribute util each resultig subset has istaces that belog to exactly oe of the M classes. The selectio of attribute at each stage of costructio of the decisio tree maximizes the estimated expected iformatio gaied from kowig the value of the attribute i questio. Differet algorithms for decisio tree iductio differ from each other i terms of the criterio that is used to evaluate the splits that correspod to tests o differet cadidate attributes. The choice of the attribute at each ode of the decisio tree greedily maximizes (or miimizes) the chose splittig criterio. Ofte, decisio tree algorithms also iclude a pruig phase to alleviate the problem of over fittig the traiig data. For the sake of simplicity of expositio, we limit our discussio to decisio tree costructio without pruig. However, it is relatively straightforward to modify the proposed algorithms to icorporate a variety of pruig methods.

10 Splittig Criteria Some of the popular splittig criteria are based o etropy (Quila, 1986), which is used by Quila's ID3 algorithm ad its variats, the Gii Idex (Breima et al., 1984) which is used by Breima's CART algorithm, amog others. More recetly, additioal splittig criteria that are useful for exploratory data aalysis have bee proposed (Bua ad Lee, 2001). Cosider a set of istaces S that is partitioed ito M disoit subsets (classes) C, C, 2, C 1 M such that S M C i i1 ad C i. The estimated probability that a radomly chose i C istace s S belogs to the class C is p C S, where X deotes the cardiality of the set X. The estimated etropy of a set S measures the expected iformatio eeded to idetify the class C C membership of istaces i S, ad is defied as follows: etropy( S) log 2. S S The estimated Gii idex for the set S cotaiig examples from M classes is defied as follows: 2 ( ) 1 C gii S. Give some impurity measure (either the etropy or Gii idex, or ay S other measure that ca be defied based o the probabilities p ) we ca defie the estimated iformatio gai for a attribute a, relative to a collectio of istaces S as follows: IGai( S, a) I ( S) vvalues( a) Sv I ( S S v ), where Values(a) is the set of all possible values for attribute a, S v is the subset of S for which attribute a has value v, ad I(S) ca be etropy(s), gii(s) or ay other suitable measure.

11 It follows that the iformatio requiremets of decisio tree learig algorithms are the same for both these splittig criteria; i both cases, we eed the relative frequecies computed from the relevat istaces. I fact, additioal splittig criteria that correspod to other impurity measures ca be used istead, provided that these measures ca be computed based o the statistics that ca be obtaied from the data sets. Examples of such splittig criteria iclude misclassificatio rate, oe-sided purity, oe-sided extremes (Bua ad Lee, 2001). This turs out to be quite useful i practice sice differet criteria ofte provide differet isights about data. Furthermore, as we show below, the iformatio ecessary for decisio tree costructio ca be efficietly obtaied from distributed data sets. This results i provably exact algorithms for decisio tree iductio from horizotally or vertically fragmeted distributed data sets. Distributed Iformatio Extractio Assume that give a partially costructed decisio tree, we wat to choose the ext best attribute for splittig. Let a () deote the attribute at the th ode alog a path startig from the attribute a 1 () that correspods to the root of the decisio tree, leadig up to the ode i questio a l () at depth l. Let v(a ()) deote the value of the attribute a (), correspodig to the th ode alog the path. For addig a ode below a l (), the set of examples beig cosidered satisfy the followig costraits o values of attributes: L ) [ a ( ) v( a ( ))] [ a ( ) v( a ( ))] [ a ( ) v( a ( ))], ( l l where [ a ( ) v( a ( ))] deotes the fact that the value of the th attribute alog the path is [ a ( ) v( a ( ))]. It follows from the precedig discussio that the iformatio required for costructig decisio trees are the couts of examples that satisfy specified costraits o the values of particular

12 attributes. These couts have to be obtaied oce for each ode that is added to the tree startig with the root ode. If we ca devise distributed iformatio extractio operators for obtaiig the ecessary couts from distributed data sets, we ca obtai exact distributed decisio tree learig algorithms. Thus, the decisio tree costructed from a give data set i the distributed settig is exactly the same as that obtaied i the batch settig whe usig the same splittig criterio i both cases. Horizotally Distributed Data Whe the data is horizotally distributed, examples correspodig to a particular value of a particular attribute are scattered at differet locatios. I order to idetify the best split of a particular ode i a partially costructed tree, all the sites are visited ad the couts correspodig to cadidate splits of that ode are accumulated. The learer uses these couts to fid the attribute that yields the best split to further partitio the set of examples at that ode. Thus, give L(), i order to split the ode correspodig to a ( ) v( a ( )), the iformatio extractio compoet l has to obtai the couts of examples that belog to each class for each possible value of each cadidate attribute. Let D be the total umber of examples i the distributed data set; A, the umber of attributes; V the maximum umber of possible values per attribute; the umber of sites; M the umber of classes; ad size(t) the umber of odes i the decisio tree. For each ode i the decisio tree T, the iformatio extractio compoet has to sca the data at each site to calculate the correspodig couts. We have: D i D. Therefore, i the case of serial distributed learig, i1 the time complexity of the resultig algorithm is D A size( T ). This ca be further improved i l

13 the case of parallel distributed learig sice each site ca perform iformatio extractio i parallel. For each ode i the decisio tree T, each site has to trasmit the couts based o its local data. These couts form a matrix of size M A V. Hece, the commuicatio complexity (the total amout of iformatio that is trasmitted betwee sites) is give by M A V size( T ). It is worth otig that some of the bouds preseted here ca be further improved so that they deped o the height of the tree istead of the umber of odes i the tree by takig advatage of the sort of techiques that are itroduced i (Shafer et al., 1996; Gehrke et al., 1999). Vertically Distributed Data I vertically distributed datasets, we assume that each example has a uique idex associated with it. Subtuples of a example are distributed across differet sites. However, correspodece betwee subtuples of a tuple ca be established usig the uique idex. As before, give L(), i order to split the ode correspodig to a ( ) v( a ( )), the iformatio extractio compoet has to l l obtai the couts of examples that belog to each class for each possible value of each cadidate attribute. Sice each site has oly a subset of the attributes, the set of idices correspodig to the examples that match the costrait L() have to be trasmitted to the sites. Usig this iformatio, each site ca compute the relevat couts that correspod to the attributes that are stored at the site. The hypothesis geeratio compoet uses the couts from all the sites to select the attribute to further split the ode correspodig to a ( ) v( a ( )). For each ode i the decisio tree T, each l site has to compute the relevat couts of examples that satisfy L() for the attributes stored at that site. The umber of subtuples stored at each site is D ad the umber of attributes at each site is bouded by the total umber of attributes A. I the case of serial distributed learig, time complexity is give by D A size( T ). This ca be further improved i the case of parallel l

14 distributed learig sice the various sites ca perform iformatio extractio i parallel. For each ode i the tree T, we eed to trasmit to each site, the set of idices for the examples that satisfy correspodig costrait L() ad get back the relevat couts for the attributes that are stored at that site. The umber of idices is bouded by D ad the umber of couts is bouded by M A V. Hece, the commuicatio complexity is give by ( D M A V ) size( T ). Agai, it is possible to further improve some of these bouds so that they deped o the height of the tree istead of the umber of odes i the tree usig techiques similar to those itroduced i (Shafer et al., 1996; Gehrke et al., 1999). Distributed versus Cetralized Learig Our approach to learig decisio trees from distributed data based o a decompositio of the learig task ito a distributed iformatio extractio compoet ad a hypothesis geeratio compoet sites provides a effective way to deal with scearios i which the sites provide oly statistical summaries of the data o demad ad prohibit access to raw data. Eve whe it is possible to access the raw data, the distributed algorithm compares favorably with the correspodig cetralized algorithm, which eeds access to the etire data set wheever its commuicatio cost is less tha the cost of collectig all of the data i a cetral locatio. It follows from the precedig aalysis that i the case of horizotally fragmeted data, the distributed algorithm has a advatage whe M V size( T ) D sice the cost of shippig the data is give by its actual size, which is give by D A. I the case of vertically fragmeted data, the correspodig coditios are give by size( T ) A sice the cost of shippig the data is give by its actual size, which has a lower boud of D A. These coditios are ofte met i the case of large, high-dimesioal data sets.

15 SUMMARY AND DISCUSSION Efficiet learig algorithms with provable performace guaratees for kowledge acquisitio from distributed data sets costitute a key elemet of ay attempt to traslate recet advaces i our ability to gather ad store large volumes of data ito a ability to effectively use the data to advace our uderstadig of the respective domais (e.g., biological scieces, atmospheric scieces) ad decisio support tools. I this paper, we have precisely formulated a class of distributed learig problems ad preseted a geeral strategy for trasformig a class of traditioal machie learig algorithms ito distributed learig algorithms. We have demostrated the applicatio of this strategy to devise itelliget agets for decisio tree iductio (usig a variety of splittig criteria) from distributed data. The resultig agets are based o algorithms that are provably exact i that the decisio tree costructed from distributed data is idetical to that obtaied by the correspodig algorithm whe it is used i the batch settig. This esures that the etire body of theoretical (e.g., sample complexity, error bouds) ad empirical results obtaied i the batch settig carry over to the distributed settig. The proposed distributed decisio tree iductio agets have bee implemeted as part of INDUS, a aget-based system for data-drive kowledge acquisitio from heterogeeous, distributed, ad autoomous data sources. I the proposed approach to learig from distributed data, the hypothesis geeratio compoet ca be viewed as the cotrol part of the learig process, which deploys the iformatio compoet as eeded. The boudary that defies the divisio of labor betwee distributed iformatio extractio ad hypothesis geeratio compoets depeds o the hypothesis class used for learig, ad the batch learig algorithm ad the particular decompositio used. At oe extreme,

16 if o iformatio extractio is performed, the hypothesis geeratio compoet eeds to access the raw data. A example of this sceario is provided by distributed istace based learig of k earest eighbor classifiers from a horizotally fragmeted data set. Here, the data set fragmets are simply stored at the differet sites. Classificatio of a ew istace is performed by the hypothesis geeratio compoet, which computes the k earest eighbors of the istace to be classified (based o some specified distace metric) by visitig the differet sites. The classificatio assiged to the istace is the same as the maority class amog the k earest eighbors of the istace. At the other extreme, if the iformatio extractio compoet does most of the work, the task of the hypothesis geeratio compoet becomes trivial. For example, cosider the fid-s algorithm for learig purely couctive cocepts, which, startig with the couctio of all literals successively elimiates the literals that lead to misclassificatio of positive examples (Mitchell, 1997). A straightforward adaptatio of this algorithm results i a provably exact distributed learig of couctive cocepts from horizotally fragmeted data sets. I this cotext, it is iterestig to explore the optimal divisio of labor betwee iformatio extractio ad hypothesis geeratio compoets for differet learig problems uder differet coditios. Distributed learig problem has begu to receive cosiderable attetio i recet years. However, may of the algorithms proposed i the literature (Davies ad Edwards, 1999; Domigos, 1997; Prodromidis et al., 2000) do ot guaratee geeralizatio accuracies that are provably close to those obtaiable i the cetralized settig. Typically, they deal with oly horizotally fragmeted data. Furthermore, several of them are motivated by the desire to scale up batch learig algorithms to work with large data sets by partitioig the data ad parallelizig the algorithm. I this case, the algorithm typically starts with the etire data set i a cetral locatio; the data set is the distributed

17 across multiple processors to take advatage of parallel processig. I cotrast, i the distributed sceario discussed i this paper, the algorithm may be prohibited from accessig the raw data; eve whe it is possible to access the raw data, it may be ifeasible to gather all of the data at a cetral locatio (because of the badwidth ad storage costs ivolved). A algorithm based o Fourier expasio of Boolea fuctios (Kargupta et al., 1999) deals with vertically distributed data sets. However, i its preset form, it is computatioally very expesive ad offers o prove guaratees of performace relative to the batch algorithm. Furthermore, sice a give set of coefficiets ca correspod to multiple decisio trees, it does ot yield a uique decisio tree from a give data set. I cotrast, the algorithms proposed i this paper guaratee provably exact learig from horizotally or vertically fragmeted distributed data sets. The algorithm proposed i (Bhatagar ad Sriivasa, 1997), is closely related to our algorithm for learig decisio trees from vertical fragmeted data usig etropy or iformatio gai as the splittig criterio. It provides a mechaism for obtaiig couts from implicit tuples i the absece of a uique idex for each tuple i the data set by simulatig the effect of oi operatio o the sites without eumeratig the tuples. I cotrast, our algorithms assume the existece of a uique idex, but are more geeral i other respects (ability to deal with both horizotal ad vertical fragmetatio, icorporatio of multiple splittig criteria). Our approach ca be modified usig a approach similar to that used i (Bhatagar ad Sriivasa, 1997) i the absece of uique idices. Work i progress is aimed at the elucidatio of the ecessary ad sufficiet coditios that guaratee the existece of exact or approximate distributed learig algorithms i terms of the properties of data ad hypothesis represetatios as well as iformatio extractio ad learig operators; characterizatio of iformatio requiremets for distributed learig uder various assumptios; ivestigatio of optimum divisio of labor betwee the iformatio gatherig ad

18 hypothesis geeratio compoets of the algorithm uder differet assumptios; desig of ew classes of theoretically well-fouded algorithms for distributed learig; itegratio of distributed learig algorithms with databases usig ew database operators for distributed iformatio extractio (e.g., operators for obtaiig couts proposed i (Graefe et al., 1998)); addressig the issues that arise i dealig with large databases whe the processig has to be doe uder sigificat memory or processig costraits (Gehrke et al., 1999); itegratio of machie learig with visualizatio for exploratory data aalysis; icorporatio of domai, ad possibly applicatiospecific otologies to bridge sytactic ad sematic mismatches across distributed data sets; ad applicatio of the resultig techiques to large-scale data-drive kowledge discovery tasks i applicatios such as computatioal biology ad itrusio detectio. REFERENCES Bhatagar, R., Sriivasa, S.(1997). Patter discovery i distributed databases. I proceedigs of AAAI 1997 coferece, Providece, RI. Breima, L., Friedma, J.H., Olshe, R.A., Stoe, C.J. (1984). Classificatio ad regressio trees. Wadsworth, Pacific Grove, CA. Bua, A. Lee, Y.S. (2001). Data Miig Criteria for Tree-Based Regressio ad Classificatio. I proceedigs of KDD 2001 coferece, Sa Fracisco, CA. Caragea, D., Silvescu, A., Hoavar, V. (2000). Toward a theoretical framework for aalysis ad sythesis of distributed ad parallel kowledge discovery. I proceedigs of KDD 2000 Workshop o DPKD, Bosto, MA. Caragea, D., Silvescu, A., & Hoavar, V. (2001). Aalysis ad Sythesis of Agets That Lear from Distributed Dyamic Data Sources. I: Emerget Neural Computatioal Architectures Based o Neurosciece 2001: Spriger. Davies, W., Edwards, P. (1999). Dagger: a ew approach to combiig multiple models leared from disoit subsets. I proceedigs of ICML 1999, Bled, Sloveia. Domigos, P. (1997). Kowledge acquisitio from examples via multiple models. I proceedigs of ICML 1997, Nashville, TN.

19 Gehrke, J., Gati, V., Ramakrisha, R. Loh, W.Y. (1999). Boat - optimistic decisio tree costructio. I proceedigs of SIGMOD 1999 coferece, Philadelphia, Pesylvaia. Graefe, G. Fayyad, U. Chaudhuri, S. (1998). O the efficiet gatherig of sufficiet statistics for classificatio from large sql databases. I proceedigs of the KDD 1998 coferece, Melo Park, CA, Hoavar, V., Miller, L. Wog, J.S. (1998). Distributed kowledge etworks. I proceedigs of the IEEE coferece o IT, Syracuse, NY, Jeigs, N., Wooldridge, M. (2001). Aget-orieted software egieerig. I Bradshaw, J. (Ed.), Hadbook of aget techology. AAAI/MIT Press, Kargupta, H., Park, B.H., Hershberger, D., Johso, E. (1999). Collective data miig: A ew perspective toward distributed data miig. I Kargupta, H. Cha, P. (Eds.), Advaces i distributed ad parallel kowledge discovery. MIT/AAAI Press, Mitchell, T.M. (1997). Machie learig. McGraw Hill. Prodromidis, A.L., Cha, P., Stolfo, S.J. (2000). Meta-learig i distributed data miig systems: Issues ad approaches. I Kargupta, H. Cha, P. (Eds.), Advaces i distributed ad parallel kowledge discovery. MIT/AAAI Press, Quila, R. (1986). Iductio of decisio trees. Machie Learig, 1: , Shafer, J.C., Agrawal, R., Mehta, M. Sprit: a scalable parallel classifier for data miig. I proceedigs of 22th iteratioal coferece o VLDB, Mumbai (Bombay), Idia. Morga Kaufma.

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation 6-0-0 Kowledge Trasformatio from Task Scearios to View-based Desig Diagrams Nima Dezhkam Kamra Sartipi {dezhka, sartipi}@mcmaster.ca Departmet of Computig ad Software McMaster Uiversity CANADA SEKE 08