Additive Groves of Regression Trees

Size: px

Start display at page:

Download "Additive Groves of Regression Trees"

Leslie Jacobs
5 years ago
Views:

1 Addtve Groves of Regresson Trees Dara Sorokna, Rch Caruana, and Mrek Redewald Department of Computer Scence, Cornell Unversty, Ithaca, NY, USA Abstract. We present a new regresson algorthm called Groves of trees and show emprcally that t s superor n performance to a number of other establshed regresson methods. A Grove s an addtve model usually contanng a small number of large trees. Trees added to the Grove are traned on the resdual error of other trees already n the Grove. We begn the tranng process wth a sngle small tree n the Grove and gradually ncrease both the number of trees n the Grove and ther sze. Ths procedure ensures that the resultng model captures the addtve structure of the response. A sngle Grove may stll overft to the tranng set, so we further decrease the varance of the fnal predctons wth baggng. We show that n addton to exhbtng superor performance on a sute of regresson test problems, bagged Groves of trees are very resstant to overfttng. 1 Introducton We present a new regresson algorthm called Grove, an ensemble of addtve regresson trees. We ntalze a Grove wth a sngle small tree. The Grove s then gradually expanded: on every teraton ether a new tree s added, or the trees that already are n the Grove are made larger. Ths process s desgned to try to fnd the smplest model (a Grove wth the fewest number of small trees) that captures the underlyng addtve structure of the target functon. As tranng progesses, ths algorthm yelds a sequence of Groves of slowly ncreasng complexty. Eventually, the largest Groves may begn to overft the tranng set even as they contnue to learn mportant addtve structure. Ths overfttng s reduced by applyng baggng on top of the Grove learnng process. In Secton 2 we descrbe the Grove algorthm step by step, begnnng wth the classcal way of tranng addtve models and ncrementally makng ths process more complcated and better performng at each step. In Secton 3 we compare bagged Groves wth two other regresson ensembles: bagged regresson trees and stochastc gradent boostng. The results show that bagged Groves outperform these other methods and work especally well on hghly non-lnear data sets. In Secton 4 we show that bagged Groves are resstant to overfttng. We conclude and dscuss future work n Secton 5. J.N. Kok et al. (Eds.): ECML 2007, LNAI 4701, pp , c Sprnger-Verlag Berln Hedelberg 2007

2 324 D. Sorokna, R. Caruana, and M. Redewald 2 Algorthm Bagged Groves of Trees, or bagged Groves for short, s an ensemble of regresson trees. Specfcally, t s a bagged addtve model of regresson trees where each ndvdual addtve model s traned n an adaptve way by gradually ncreasng both number of trees and ther complexty. Regresson Trees. The unt model n a Grove s a regresson tree. Algorthms for tranng regresson trees dffer n two major aspects: (1) the crteron for choosng the best splt n a node and (2) the way n whch tree complexty s controlled. We use trees that optmze RMSE (root mean squared error) and we control tree complexty (sze) by mposng a lmt on the sze (number of cases) at an nternal node. If the fracton of the data ponts that reach a node s less than a specfed threshold α, then the node s declared a leaf and s not splt further. Hence the smaller α, 0 α 1, the larger the tree. (See Table 7.) Note that because we wll later bag the tree models, the specfc choce of regresson tree s not partcularly mportant. The man requrement s that the complextyofthetreeshouldbecontrollable. 2.1 Addtve Models Classcal Algorthm A Grove of trees s an addtve model where each addtve term s represented by a regresson tree. The predcton of a Grove s computed as the sum of the predctons of these trees: F (x) =T 1 (x)+t 2 (x)+ + T N (x). Here each T (x), 1 N, s the predcton made by the -th tree n the Grove. The Grove model has two man parameters: N, the number of trees n the Grove, and α, whch controls the sze of each ndvdual tree. We use the same value of α for all trees n a Grove. In statstcs, the basc mechansm for tranng an addtve model wth a fxed number of components s the backfttng algorthm [1]. We wll refer to ths as the Classcal algorthm for tranng a Grove of regresson trees (Algorthm 1). The algorthm cycles through the trees untl the trees converge. The frst tree n the Grove s traned on the orgnal data set, a set of tranng ponts {(x,y)}. Let ˆT 1 denote the functon encoded by ths tree. Then we tran the second tree, whch encodes ˆT 2, on the resduals,.e., on the set {(x,y ˆT 1 (x))}. Thethrdtree then s traned on the resduals of the frst two,.e., on {(x,y ˆT 1 (x) ˆT 2 (x))}, and so on. After we have traned N trees ths way, we dscard the frst tree and retran t on the resduals of the other N 1 trees,.e. on the set {(x,y ˆT 2 (x) ˆT 3 (x) ˆT N (x))}. Then we smlarly dscard and retran the second tree, and so on. We keep cyclng through the trees n ths way untl there s no sgnfcant mprovement n the RMSE on the tranng set. Baggng. As wth sngle decson trees, a sngle Grove tends to overft to the tranng set when the trees are large. Such models show a large varance wth respect to specfc subsamples of the tranng data and beneft sgnfcantly from

3 Addtve Groves of Regresson Trees 325 Algorthm 1. Classcal addtve model tranng functon Classcal(α,N,{x,y}) for =1to N do Tree (α,n) =0 Converge(α,N,{x,y}, Tree (α,n) 1,...,Tree (α,n) N ) functon Converge(α,N,{x,y},Tree (α,n) 1,...,Tree (α,n) N ) repeat for =1to N do newtranset = {x,y k Tree(α,N) k = TranTree(α, newtranset) Tree (α,n) (x)} untl (change from the last teraton s small) baggng, a well-known procedure for mprovng model performance by reducng varance [2]. On each teraton of baggng, we draw a bootstrap sample (bag) from the tranng set, and tran the full model (n our case a Grove of addtve trees) from that sample. After repeatng ths procedure a number of tmes, we end up wth an ensemble of models. The fnal predcton of the ensemble on each test data pont s an average of the predctons of all models. Example. In ths secton we llustrate the effects of dfferent methods of tranng bagged Groves on synthetc data. The synthetc data set was generated by a functon of 10 varables that was prevously used by Hooker [3]. F (x) =π x1x2 2x 3 sn 1 (x 4 )+log(x 3 + x 5 ) x 9 x 10 x7 x 8 x 2 x 7 (1) Varables x 1, x 2, x 3, x 6, x 7, x 9 are unformly dstrbuted between 0.0 and1.0 and varables x 4, x 5, x 8 and x 10 are unformly dstrbuted between 0.6 and Fgure 1 shows a contour plot of how model performance depends on both α, the sze of tree, and N, the number of trees n a Grove, for 100 bagged Groves traned wth the classcal method on 1000 tranng ponts from the above data set. The performance s measured as RMSE on an ndependent test set consstng of 25,000 ponts. Notce that lower RMSE mples better performance. The bottommost horzontal lne for N = 1 corresponds to baggng sngle trees. The plot clearly ndcates that by ntroducng addtve model structure, wth N > 1, performance mproves sgnfcantly. We can also see that the best performance s acheved by Groves contanng 5-10 relatvely small trees (large α), whle for larger trees performance deterorates. 2.2 Layered Tranng When ndvdual trees n a Grove are large and complex, the Classcal addtve model tranng algorthm (Secton 2.1) can overft even f baggng s appled. 1 Ranges are selected to avod extremely large or small functon values.

4 326 D. Sorokna, R. Caruana, and M. Redewald Algorthm 2. Layered tranng functon Layered(α,N,tran) α 0 =,α 1 =,α 2 =,...,α max = α for j =0to max do f j =0then for =1to N do Tree (α 0,N) =0 else for =1to N do Tree (α j,n) =Tree (α j 1,N) Converge(α j,n,tran,tree (α j,n) 1,...,Tree (α j,n) N ) Consder the extreme case α = 0,.e., a Grove of full trees. The frst tree wll perfectly model the tranng data, leavng resduals wth value 0 for the other trees n the Grove. Hence the ntended Grove of several large trees wll degenerate to a sngle tree. One could address ths ssue by lmtng trees to very small sze. However, we stll would lke to be able to use large trees n a Grove so that we can capture complex and non-lnear functons. To prevent the degeneraton of the Grove as the trees become larger, we developed a layered tranng approach. In the frst round we grow N small trees. Then n later cycles of dscardng and re-tranng the trees n the Grove we gradually ncrease tree sze. More precsely, no matter what the value of α, we always start the tranng process wth small trees, typcally usng a start value α 0 =. Let α j denote the value of the sze parameter after j teratons of the Layered algorthm (Algorthm 2). After reachng convergence for α j 1, we ncrease tree complexty by settng α j to approxmately half the value of α j 1. We contnue to cycle through the trees, re-tranng all trees n the Grove n the usual way, but now allow them to reach the sze correspondent to the new larger α j, and as before, we proceed untl the Grove converges on ths layer. We keep gradually ncreasng tree sze untl α j α. For a tranng set wth 1000 data ponts and α = 0, we use the followng sequence of values of α j :(,,, 0.05, 0.02, 0.01, 0.005, 0.002, 0.001). It s worth notng that whle tranng a Grove of large trees, we automatcally obtan all Groves wth the same N for all smaller tree szes n the sequence. Fgure 2 shows how 100 bagged Groves traned by the layered approach perform on the synthetc data set. Overall performance s much better than for the classcal algorthm and bagged Groves of N large trees now perform at least as well as bagged Groves of N smaller trees. 2.3 Dynamc Programmng Tranng There s no reason to beleve that the best (α, N) Grove should always be constructed from a ( 2α, N) Grove. In fact, a large number of small trees mght

5 Addtve Groves of Regresson Trees 327 overft the tranng data and hence lmt the beneft of ncreasng tree sze n later teratons. To avod ths problem, we need to gve the Grove tranng algorthm addtonal flexblty n choosng the rght balance between ncreasng tree sze and the number of trees. Ths s the motvaton behnd the Dynamc Programmng Grove tranng algorthm. Ths algorthm can choose to construct a new Grove from an exstng one by ether addng a new tree (whle keepng tree sze constant) or by ncreasng tree sze (whle keepng the number of trees constant). Consderng the parameter grd, the Grove for a grd pont (α j,n) could be constructed ether from ts left neghbor (α j 1,n)orfromtslowerneghbor(α j,n 1). Pseudo-code for ths approach s shown n Algorthm 3. We make a choce between the two optons for computng each Grove (addng another tree or makng the trees larger) n a greedy manner,.e., the one that results n better performance of the Grove on the valdaton set. We use the out-of-bag data ponts [4] as the valdaton set for choosng whch of the two Groves to use at each step. Fgure 3 shows how the Dynamc Programmng approach mproves bagged Groves over the layeredtranng. Fgure 4 shows the choces that are made durng the process: t plots the average dfference between RMSE of the Grove created from the lower neghbor (ncrease n) and performance of the Grove created from the left neghbor (decrease α j ). That s, a negatve value means that the former s preferred, whle a postve value means that the latter s preferred at that grd pont. We can see that for ths data set ncreasng the tree sze s the preferred drecton, except for cases wth many small trees. Ths dynamc programmng verson of the algorthm does not explore all possble sequences of steps to buld a Grove of trees, because we requre that every grove bult n the process should contan trees of equal sze. We have tested several other possble approaches that don t have ths restrcton, but they faled to produce any mprovements and were notceably worse from the runnng tme pont of vew. For these reasons we prefer the dynamc programmng verson over other, less restrcted optons. 2.4 Randomzed Dynamc Programmng Tranng Our bagged Grove tranng algorthms so far performed baggng n the usual way,.e., create a bag of data, tran all Groves for dfferent vallues of (α, N) onthat bag, then create the next bag, generate all models on ths bag; and so on for 100 dfferent bags. When the Dynamc Programmng algorthm generates a Grove usng the same bag,.e., the same tran set that was used to generate ts left and lower neghbors, complex models mght not be very dfferent from ther neghbors because those neghbors already mght have overftted and there s not enough tranng data to learn anythng new. We can address ths problem by usng a dfferent bag of data on every step of the Dynamc Programmng algorthm, so that every Grove has some new data to learn from. Whle performance of a sngle Grove mght become worse, performance of bagged Groves mproves due to ncreased varablty n the models. Fgure 5 shows the mproved performance of ths fnal verson of our Grove tranng approach. The most complex Groves are

6 328 D. Sorokna, R. Caruana, and M. Redewald #trees n a grove #trees n a grove Alpha (sze of leaf) Alpha (sze of leaf) 5 Fg. 1. RMSE of bagged Grove, Classcal algorthm Fg. 2. RMSE of bagged Grove, Layered algorthm #trees n a grove Alpha (sze of leaf) #trees n a grove Alpha (sze of leaf) Fg. 3. RMSE of bagged Grove, Dynamc Programmng algorthm Fg. 4. Dfference n performance between horzontal and vertcal steps #trees n a grove Alpha (sze of leaf) #trees n a grove Alpha (sze of leaf) Fg. 5. RMSE of bagged Grove (100 bags), Randomzed Dynamc Programmng algorthm Fg. 6. RMSE of bagged Grove (500 bags), Randomzed Dynamc Programmng algorthm

7 Addtve Groves of Regresson Trees 329 Algorthm 3. Dynamc Programmng Tranng functon DP(α,N,tranSet) α 0 =,α 1 =,α 2 =,...,α max = α for j =0to max do for n =1to N do for =1to n 1 do Tree attempt1, =Tree (α j,n 1) Tree attempt1,n =0 Converge(α j,n,tran,tree attempt1,1,...,tree attempt1,n) f j>0then for =1to n do Tree attempt2, =Tree (α j 1,n) Converge(α j,n,tran,tree attempt2,1,...,tree attempt2,n) wnner = Compare Treeattempt1, and Treeattempt2, on valdaton set for =1to n do Tree (α j,n) =Tree wnner, now performng worse than ther left neghbors wth smaller trees. Ths happens because those models need more baggng steps to converge to ther best qualty. Fgure 6 shows the same plot for baggng wth 500 teratons where the property more complex models are at least as good as ther less complex counterparts s restored. 3 Experments We evaluated bagged Groves of trees on 2 synthetc and 5 real-world data sets and compared the performance to two other regresson tree ensemble methods that are known to perform well: stochastc gradent boostng and bagged regresson trees. Bagged Groves consstently outperform both of them. For real data sets we performed 10 fold cross valdaton: for each run 8 folds were used as a tranng set, 1 fold as a valdaton set for choosng the best set of parameters and the last fold was used as the test set for measurng performance. For the two synthetc data sets we generated 30 blocks of data contanng 1000 ponts each and performed 10 runs usng dfferent blocks for tranng, valdaton and test sets. We report mean and standard devaton of the RMSE on the test set. Table 1 shows the results; for comparablty across data sets all numbers are scaled by the standard devaton of the response n the dataset tself. 3.1 Parameter Settngs Groves. We traned 100 bagged Groves usng the Randomzed Dynamc Programmng technque for all combnatons of parameters N and α wth 1 N

8 330 D. Sorokna, R. Caruana, and M. Redewald Table 1. Performance of bagged Groves (Randomzed Dynamc Programmng tranng) compared to boostng and baggng. RMSE on the test set averaged over 10 runs. Calforna Elevators Knematcs Computer Stock Synthetc Synthetc Housng Actvty No Nose Nose Bagged Groves RMSE StdDev Boostng RMSE StdDev Bagged trees RMSE StdDev and α {,,, 0.05, 0.02, 0.01, 0.005}. Notce that wth these settngs the resultng ensemble can consst of at most 1500 trees. From these models we selected the one that gave the best results on the valdaton set. The performance of the selected Grove on the test set s reported. Stochastc Gradent Boostng. The obvous man compettor to bagged Groves s gradent boostng [5] [6], a dfferent ensemble of trees also based on addtve models. There are two major dfferences between boostng and Groves. Frst, boostng never dscards trees,.e., every generated tree stays n the model. Grove teratvely retrans ts trees. Second, all trees n a boostng ensemble are always bult to a fxed sze, whle groves of large trees are traned frst usng groves of smaller trees. We beleve that these dfferences allow Groves to better capture the natural addtve structure of the response functon. The general gradent boostng framework supports optmzng for a varety of loss functons. We selected squared-error loss because ths s the loss functon that our current verson of the Groves algorthm optmzes for. However, lke gradent boostng, Groves can be modfed to optmze for other loss functons. Fredman [6] recommends boostng small trees wth at most 4 10 leaf nodes for best results. However, we dscovered for one of our datasets that usng larger trees wth gradent boostng dd sgnfcantly better. Ths s not surprsng snce some real datasets contan complex nteractons, whch cannot be accurately modeled by small trees. For farness we therefore also nclude larger boosted trees n the comparson than Fredman suggested. More precsely, we tred all α {1,,,, 0.05}. Table 7 shows the typcal correspondence between α and number of leaf nodes n a tree, whch was very smlar across the data sets. Prelmnary results dd not show any mprovement for tree sze beyond α =0.05. Stochastc gradent boostng deals wth overfttng by means of two technques: regularzaton and subsamplng. Both technques depend on user-set parameters. Based on recommendatons n the lterature and on our own evaluaton we used the followng values for the fnal evaluaton: and 0.05 for the regularzaton

9 Addtve Groves of Regresson Trees 331 α # leaf nodes 1 2 (stump) full tree RMSE α =, n = 5 α = 0, n = baggng teratons Fg. 7. Typcal number of leaf nodes for dfferent values of α Fg. 8. Performance of bagged Grove for smpler and more complex models coeffcent and, 0.6, and 0.8 as the fracton of the subsamplng set sze from the whole tranng set. Boostng can also overft f t s run for too many teratons. We tred up to 1500 teratons to make the maxmum number of trees n the ensemble equal for all methods n comparson. The actual number of teratons that performs best was determned based on the valdaton set, and therefore can be lower than 1500 for the best boosted model. In summary, to evaluate stochastc gradent boostng, we tred all combnatons of the values descrbed above for the 4 parameters: sze of trees, number of teratons, regularzaton coeffcent, and subsamplng sze. As for Groves, we determne the best combnaton of values for these parameters based on a separate valdaton set. Baggng. Baggng sngle trees s known to provde good performance by sgnfcantly decreasng varance of the ndvdual tree models. However, compared wth Groves and boostng, whch are both based on addtve models, bagged trees do not explctly model the addtve structure of the response functon. Increasng the number of teratons n baggng does not result n overfttng and baggng of larger trees usually produces better models than baggng smaller trees. Hence we omtted parameter tunng for baggng. Instead we smply report results for a model consstng of 1500 bagged full trees. 3.2 Datasets Synthetc Data wthout Nose. Ths s the same data set that we used as a runnng example n the earler sectons. The response functon s generated by Equaton 1. The performance of bagged Groves on ths dataset s much better than the performance of other methods. Synthetc Data wth Nose. Ths s the same synthetc dataset, only ths tme Gaussan nose s added to the response functon. The standard devaton σ of

10 332 D. Sorokna, R. Caruana, and M. Redewald the nose dstrbuton s chosen as 1/2 of the standard devaton of the response n the orgnal data set. As expected, the performance of all methods drops. Bagged Groves stll perform clearly better, but the dfference s smaller. We have used 5 regresson data sets from the collecton of Luís Torgo [7] for the next set of experments. Knematcs. The Knematcs famly of datasets orgnates from the Delve repostory [8] and descrbes a smulaton of robot arm movement. We used a kn8nm verson of the dataset: 8192 cases, 8 contnuous attrbutes, hgh level of nonlnearty, low level of nose. Groves show 20% mprovement over gradent boostng on ths dataset. It s worth notcng that boostng preferred large trees on ths dataset; trees wth α =0.05 showed clear advantage over smaller trees. However, there was no further mprovement for boostng even larger trees. We attrbute these effects to hgh non-lnearty of the data. Computer Actvty. Another dataset from the Delve repostory, descrbes the state of multuser computer systems cases, 22 contnuous attrbutes. The varance of performance for all algorthms s low. Groves show small (3%) mprovement compared to boostng. Calforna Housng. Ths s a dataset from the StatLb repostory [9] and t descrbes housng prces n Calforna from the 1990 Census: 20, 640 observatons, 9 contnuous attrbutes. Groves show 6% mprovement compared to boostng. Stock. Ths s a relatvely small (960 data ponts) regresson dataset from the StatLb repostory. It descrbes daly stock prces for 10 aerospace companes: the task s to predct the frst one from the other 9. Predcton qualty from all methods s very hgh, so we can assume that the level of nose s small. Ths s another case when Groves gve sgnfcant mprovement (18%) over gradent boostng. Elevators. Ths data set s obtaned from the task of controllng an arcraft [10]. It seems to be nosy, because the varance of performance s hgh although the data set s rather large: 16, 559 cases wth 18 contnuous attrbutes. Here we see a 6% mprovement. 3.3 Dscusson Based on the emprcal results we conjecture that Bagged Groves outperform the other algorthms most when the datasets are hghly non-lnear and not very nosy. (Nose can obscure some of the non-lnearty n the response functon, makng the best models that can be learned from the data more lnear than they would have been for models traned on the response wthout nose.) Ths can be explaned as follows. Groves can capture addtve structure yet at the same tme use large trees. Large trees capture non-lnearty and complex nteractons well, and ths gves Groves an advantage over gradent boostng whch reles mostly on addtvty. Gradent boostng usually works best wth small trees, and fals to make effectve use of large trees. At the same tme most data sets, even nonlnear ones, stll have sgnfcant addtve structure. The ablty to detect and

11 Addtve Groves of Regresson Trees 333 model ths addtvty gves Groves an advantage over baggng, whch s effectve wth large trees, but does not explctly model addtve structure. Gradent boostng s a state of the art ensemble tree method for regresson. Chpman et al [11] recently performed an extensve comparson of several algorthms on 42 data sets. In ther experments gradent boostng showed performance smlar to or better than Random Forests and a number of other types of models. Our algorthm shows performance consstently better than gradent boostng and for ths reason we do not expect that Random Forests or other methods that are not superor to gradent boostng would outperform our bagged Groves. In terms of computatonal cost, bagged Groves and boostng are comparable. In both cases a large number of tree models has to be traned (more for Groves) and there s a varety of parameter combnatons that need to be examned (more for boostng). 4 Baggng Iteratons and Overfttng Resstance In our experments we used a fxed number of baggng teratons and dd not consder ths a tunng parameter because baggng rarely overfts. In baggng the number of teratons s not as crucal as t s for boostng: f we bag as long as we can afford, we wll get the best value that we can acheve. In that sense the expermental results we report are conservatve and Bagged Groves could potentally be mproved by addtonal baggng teratons. Weobservedasmlartrendforparametersα and N as well: more complex models (larger trees, more trees) are at least as good as ther less complex counterparts, but only f they are bagged suffcently many tmes. Fgure 8 shows how the performance on the synthetc data set wth nose depends on the number of baggng teratons for two bagged Groves. The smpler one s traned wth N =5 and α = and the more complex one s traned wth N =10andα =0.We can see that eventually they converge to the same performance and that the smpler model only does better than the complex model when the number of baggng teratons s small. 2 We observed smlar behavor for the other datasets. Ths suggests that one way to get good performance wth bagged Groves mght be to buld the most complex Groves (large trees, many trees) that can be afforded and bag them many, many tmes untl performance tops out. In ths case we mght not need a valdaton set to select the best parameter settngs. However, n practce the most complex models can requre many more teratons of baggng than smpler models that acheve almost the same level of performance much faster. Hence the approach that used n our experments can be more useful n practce: select 2 Note that ths s only true because of the layered approach to tranng Groves whch trans Groves of trees of smaller sze before movng on to Groves wth larger trees. If one ntalzed a Grove wth a sngle large tree, performance of bagged Groves mght stll decrease wth ncreasng tree sze because the ablty of the Grove to learn the addtve structure of the problem would be njured.

12 334 D. Sorokna, R. Caruana, and M. Redewald a computatonally acceptable number of baggng teratons (100 seems to work fne, but one could also use 200 or 500 to be more confdent) and search for the best N and α for ths number of baggng teratons on the valdaton set. 5 Concluson We presented a new regresson algorthm, bagged Groves of trees, whch s an addtve ensemble of regresson trees. It combnes the benefts of large trees that model complex nteractons wth benefts of capturng addtve structure by means of addtve models. Because of ths, bagged Groves perform especally well on complex non-lnear datasets where the structure of the response functon contans both addtve structure (whch s best modeled by addtve trees) and varable nteractons (whch s best modeled wthn a tree). We have shown that on such datasets bagged Groves outperform state-of-the-art technques such as stochastc gradent boostng and baggng. Thanks to baggng, and the layered way n whch Groves are traned, bagged Groves resst overfttng more complex Groves tend to acheve the same or better performance as smpler Groves. Groves are good at capturng the addtve structure of the response functon. A future drecton of our work s to develop technques for determnng propertes nherent n the data usng ths algorthm. In partcular, we beleve we can use Groves to learn useful nformaton about statstcal nteractons between varables n the data set. References 1. Haste, T., Tbshran, R., Fredman, J.H.: The Elements of Statstcal Learnng. Sprnger, Hedelberg (2001) 2. Breman, L.: Baggng Predctors. Machne Learnng 24, (1996) 3. Hooker, G.: Dscoverng ANOVA Structure n Black Box Functons. In: Proc. ACM SIGKDD, ACM Press, New York (2004) 4. Bylander, T.: Estmatng Generalzaton Error on Two-Class Datasets Usng Outof-Bag Estmates. Machne Learnng 48(1 3), (2002) 5. Fredman, J.: Greedy Functon Approxmaton: a Gradent Boostng Machne. Annals of Statstcs 29, (2001) 6. Fredman, J.: Stochastc Gradent Boostng. Computatonal Statstcs and Data Analyss 38, (2002) 7. Torgo, L.: Regresson DataSets, ltorgo/regresson/datasets.html 8. Rasmussen, C.E., Neal, R.M., Hnton, G., van Camp, D., Revow, M., Ghahraman, Z., Kustra, R., Tbshran, R.: Delve. Unversty of Toronto, delve 9. Meyer, M., Vlachos, P.: StatLb. Department of Statstcs at Carnege Mellon Unversty, Camacho, R.: Inducng Models of Human Control Sklls. In: Nédellec, C., Rouverol, C. (eds.) Machne Learnng: ECML-98. LNCS, vol. 1398, Sprnger, Hedelberg (1998) 11. Chpman, H., George, E., McCulloch, R.: Bayesan Ensemble Learnng. In: Advances n Neural Informaton Processng Systems 19, pp (2007)

Application of Additive Groves Ensemble with Multiple Counts Feature Evaluation to KDD Cup 09 Small Data Set

Application of Additive Groves Ensemble with Multiple Counts Feature Evaluation to KDD Cup 09 Small Data Set Applcaton of Addtve Groves Ensemble wth Multple Counts Feature Evaluaton to KDD Cup 09 Small Data Set Dara Sorokna dara@cs.cmu.edu School of Computer Scence, Carnege Mellon Unversty, Pttsburgh PA 15213