Predicting the Density of Algae Communities using Local Regression Trees

Size: px

Start display at page:

Download "Predicting the Density of Algae Communities using Local Regression Trees"

Meagan Holt
5 years ago
Views:

1 Predctng the Densty of Algae Communtes usng Local Regresson Trees Luís Torgo LIACC Machne Learnng Group / FEP Unversty of Porto R. Campo Alegre, 823 Phone: , Fax: emal: ltorgo@ncc.up.pt WWW: ABSTRACT: Ths paper descrbes the applcaton of local regresson trees to an envronmental regresson task. Ths task was part of the 3 rd Internaton ERUDIT Competton. The technque descrbed n ths paper was announced as one of the three runner-up methods by the jury of the competton. We brefly descrbe RT, a system that s able to generate local regresson trees. We emphasse the mult-strategy features of RT that we clam as beng one of the causes for the obtaned performance. We descrbed the pre-processng steps that were taken n order to apply RT to the competton data, hghlghtng the weghed schema that was used to combne the predctons of the best RT varants. We conclude by renforcng the dea that the combnaton of features of dfferent data analyss technques can be useful to obtan hgher predctve accuracy. KEYWORDS: Regresson trees, local modellng, hybrd methods, local regresson trees. INTRODUCTION Ths paper descrbes an applcaton of local regresson trees (Torgo, 1997a, 1997b, 1999) to an envronmental data analyss task. Ths applcaton was carred out n the context of the 3 rd Internatonal Competton organsed by ERUDIT n conjuncton wth the new Computatonal Intellgence and Learnng Cluster. Ths cluster s a cooperaton between four EC-funded Networks of Excellence : ERUDIT, EvoNet, MLnet and NEuroNet. Ths paper descrbes the use of system RT (Torgo, 1999) n the context of ths competton. RT s a computer program that s able to obtan local regresson trees. In ths paper we brefly descrbe the man deas behnd local regresson trees, and then focus on the steps taken to obtan a soluton to ths data analyss task. The data analyss task concerns the envronmental problem of determnng the state of rvers and streams by montorng and analysng certan measurable chemcal concentratons wth the goal of nferrng the bologcal state of the rver, namely the densty of algae communtes. Ths study s motvated by the ncreasng concern wth the mpact human actvtes are havng on the envronment. Identfyng the key chemcal control varables that nfluence the bologcal process assocated wth these algae has become a crucal task n order to reduce the mpact of man actvtes. The data used n ths 3 rd ERUDIT nternatonal competton comes from such a study. Water qualty samples were collected from varous European rvers durng one year. These samples were analysed for varous chemcal substances ncludng: ntrogen n the form of ntrates, ntrtes and ammona, phosphate, ph, oxygen and chlorde. At the same tme, algae samples were collected to determne the dstrbutons of the algae populatons. The dynamcs of algae communtes s strongly nfluenced by the external chemcal envronment. Determnng whch chemcal factors are nfluencng more ths dynamcs s mportant knowledge that can be used to control these populatons. At the same tme there s an economcal factor motvatng even more ths analyss. In effect, the chemcal analyss s cheap and easly automated. On the contrary, the bologcal part nvolves mcroscopc examnaton, requres traned manpower and s therefore both expensve and slow. The competton task conssted of predctng the frequency dstrbuton of seven dfferent algae on the bass of eght measured concentratons of chemcal substances plus some addtonal nformaton charactersng the envronment from whch the sample was taken (season, rver sze and flow velocty). RT s a regresson analyss tool that can be seen as a knd of mult-strategy data analyss system. We wll see that ths characterstc s a consequence of the theory behnd local regresson trees. In effect, these models ntegrate regresson trees (e.g. Breman et al.,1984) wth local modellng (e.g. Cleveland and Loader, 1995). The ntegraton schema used n

2 RT allows several varants to be tred out n a gven data set. Due to ths flexblty, RT can cope wth problems havng dfferent characterstcs due to ts ablty of emulatng technques wth dfferent approxmaton bases. In ths competton we have taken advantage of ths facet of RT. We have treated the predcton of each of the seven algae frequences as a dfferent regresson task. For each of the seven regresson problems we have carred out a selecton process wth the am of estmatng whch RT varant provded the best accuracy. The followng secton provdes a bref descrpton of local regresson trees. We descrbe the man components ntegratng these hybrd regresson models and refer the method that was used to ntegrate them wthn RT. We them focus on the techncal detals concernng the applcaton of RT to the task of predctng the densty of algae communtes. LOCAL REGRESSION TREES Local Regresson Trees (Torgo, 1997a, 1997b, 1999) explore the possblty of mprovng the accuracy of regresson trees by usng smoother models at the tree leaves. These models can be regarded as a hybrd approach to multvarate regresson ntegratng dfferent solutons to ths data analyss problem. Local regresson (e.g. Cleveland and Loader, 1995) s a non-parametrc statstcal methodology that provdes smooth modellng by not assumng any partcular global form of the unknown regresson functon. On the contrary these models ft a functonal form wthn the neghbourhood of the query ponts. These models are known to provde hghly accurate predctons over a wde range of problems due to the absence of a pre-defned functonal form. However, local regresson technques are also known by ther computatonal cost, low nterpretablty and storage requrements. Regresson trees (e.g. Breman et al.,1984), on the other hand, obtan models through a recursve parttonng algorthm that nsures hgh computatonal effcency. Moreover, the resultng models are usually consdered hghly nterpretable. By ntegratng regresson trees wth local modellng, not only we mprove the accuracy of the trees, but also ncrease the computatonal effcency and comprehensblty of local models. In ths secton we provde a bref descrpton of local regresson trees. We start by descrbng both standard regresson trees and local modellng. We then address the ssue of how these two methodologes are ntegrated n our RT regresson tool. STANDARD REGRESSION TREES A regresson tree can be seen as a knd of addtve model (Haste & Tbshran, 1990) of the form where, l ( ) = k I( x ) m x (1) = 1 D k are constants; I(.) s an ndcator functon returnng 1 f ts argument s true and 0 otherwse; and D are dsjont parttons of the tranng data D such that U l = D and I l =1 D D =1 = φ. Models of ths type are sometmes called pecewse constant regresson models as they partton the predctor space 1 χ n a set of mutually exclusve regons and ft a constant value wthn each regon. An mportant aspect of tree-based regresson models s that they provde a propostonal logc representaton of these regons n the form of a tree. Each path from the root of the tree to a leaf corresponds to such a regon. Each nner node 2 of the tree s a logcal test on a predctor varable 3. In the partcular case of bnary trees there are two possble outcomes of the test, true or false. Ths means that assocated to each partton D we have a path P consstng of a conjuncton of logcal tests on the predctor 1 The multdmensonal space formed by all nput (or predctor) varables of a multvarate regresson problem. 2 All nodes except the leaves. 3 Although work exsts on multvarate tests (e.g. Breman et. al. 1984; Murthy et. al., 1994; Broadley & Utgoff, 1995; Gama, 1997).

3 varables. Ths symbolc representaton of the regresson surface s an mportant ssue when one wants to have a better understandng of the problem under consderaton. For nstance, consder the followng small example of a regresson tree obtaned by applyng RT to the competton data to obtan a model for one of the algae: T 39.8 ± 5.9 P 1 Ch , wth k 1 = 39.8 Ch P 2 Ch 7 > 43.8 Ch , wth k 2 = 50.1 F Ch T F 50.1 ± 12.2 Ch T F 12.9 ± ± 1.4 P 3 Ch 7 > 43.8 Ch 3 > 7.16 Ch , wth k 3 = 12.9 P 4 Ch 7 > 43.8 Ch 3 > 7.16 Ch 6 > 51.1, wth k 4 = 3.85 Fgure 1: An Example of a Regresson Tree. As there are four dstnct paths from the root node to the leaves, ths tree dvdes the nput space n four dfferent regons. The conjuncton of the tests n each path can be regarded as a logcal descrpton of such regons, as shown above. Ths tree can be used both to make predctons of the densty of ths alga for future water samples. Moreover, t provdes nformaton on whch varables nfluence more ths densty. Regresson trees are constructed usng a recursve parttonng (RP) algorthm. Ths algorthm bulds a tree by recursvely splttng the tranng sample nto smaller subsets. We gve below a hgh level descrpton of the algorthm. The RP algorthm receves as nput a set of n data ponts, { } t n D t = x, y, and f certan termnaton crtera are not met t generates a test node t, whose branches are obtaned by applyng the same algorthm wth two subsets of the nput data ponts. These subsets consst of the cases that logcally ental the test n the node, Dt {, y Dt : s } * L = x and the remanng cases, D = { x, y D : x s } *. At each node the best splt test s chosen accordng to some t R local crteron, whch means that ths s a greedy hll-clmbng algorthm. t = 1 x, The Recursve Parttonng Algorthm. Input : A set of n data ponts, { < x, y > }, = 1,...,n Output : A regresson tree IF termnaton crteron THEN Create Leaf Node and assgn t a Constant Value Return Leaf Node ELSE Fnd Best Splttng Test s* Create Node t wth s* Left_branch(t) = RegressonParttonngAlgorthm({ <x, y > : x s* }) Rght_branch(t) = RegressonParttonngAlgorthm({ <x, y > : x / s* }) Return Node t ENDIF The algorthm has three man components: A way to select a splt test (the splttng rule). A rule to determne when a tree node s termnal (termnaton crteron). A rule for assgnng a value to each termnal node.

4 The answer to these three problems s related to the error crteron that s selected to gude the growth of the tree. The most common choce s the mnmsaton of the mean squared error (MSE). Usng ths crteron the constant that should be used n the leaves of the trees (termnal nodes) s the average goal varable value of the cases that fall n each leaf. Moreover, ths crteron leads to the followng rule for determnng the best splt of each test node (Torgo, 1999): The best splt s * s the splt that maxmses the expresson, where, S L y and = D t L S n 2 L S. R = y D t R 2 R S + (2) n tl t R Ths rule leads to fast ncremental updatng algorthms that allow effcent search for the best test of each node (Torgo, 1999). Fnally, the termnaton crteron s related to the ssue of relable error estmaton. Methodologes lke regresson trees that have a rch hypothess (model) language, ncur the danger of overfttng the tranng data. Ths n turn leads to unnecessarly large models that usually have poor generalsaton performance (.e. lower predctve accuracy on new samples of the same problem). Overfttng avodance n regresson trees s usually acheved through a process of prunng unrelable branches of an overly large tree (e.g. Breman et al.,1984). Many prunng technques exst (see Torgo, 1999 for an overvew), most of them relyng on relable estmaton of predctve error of the trees (Torgo, 1998). Regresson trees acheve compettve predctve accuracy n a wde range of applcatons. Moreover, ther computatonal effcency allows dealng wth problems wth hundreds of varables and hundreds of thousands of cases. Stll, regresson trees provde a hghly non-smooth functon approxmaton wth strong dscontnutes 4. LOCAL MODELLING Local modellng 5 (Fan, 1995) belongs to a data analytc methodology whose basc dea behnd conssts of obtanng the predcton for a data pont x by fttng a parametrc functon n the neghbourhood of x. Ths means that these methods are locally parametrc as opposed to, for nstance, least squares lnear regresson. Moreover, these methods do not produce a vsble model of the data. Instead they make predctons based on local models generated on a query pont bass. Accordng to Cleveland and Loader (1995) local regresson traces back to the 19 th century. These authors provde a hstorcal survey of the work done snce then. The modern work on local modellng starts n the 1950 s wth the kernel methods ntroduced wthn the probablty densty estmaton settng (Rosenblatt,1956; Parzen,1962) and wthn the regresson settng (Nadaraya,1964; Watson,1964). Local polynomal regresson (Stone,1977; Cleveland,1979; Katkovnk,1979) s a generalsaton of ths early work on kernel regresson. In effect, kernel regresson amounts to fttng a polynomal of degree zero (a constant) n a neghbourhood. Summarsng, we can state the general goal of local regresson as tryng to ft a polynomal of degree p around a query pont (or test case) q usng the tranng data n ts neghbourhood. Ths ncludes the varous avalable settngs lke kernel regresson (p=0), local lnear regresson (p=1), etc 6. Most studes on local modellng are done for the case of one unvarate problems. Stll, the framework s applcable to the multvarate case and has been used wth success n some domans (Atkeson et al.,1997; Moore et al.,1997). However, several authors alert for the danger of applyng these methods n hgher nput space dmensons 7 (Hardle, 1990; Haste & Tbshran,1990). The problem s that wth hgh number of varables the tranng cases are so sparse that the noton of local neghbourhood can hardly be seen as local. Another drawback of local modellng s the complete lack of nterpretablty of the models. No nterpretable model of the tranng data s obtaned. Wth some smulaton work one may obtan a graphcal pcture of the approxmaton provded by local regresson models, but ths s only possble wth low number of nput varables. 4 Ths later characterstcs can both be seen as advantageous or dsadvantageous, dependng on the applcaton. 5 Also known as non-parametrc smoothng and local regresson. 6 A further generalsaton of ths set-up conssts of usng polynomal mxng (Cleveland & Loader,1995), where p can take non-nteger values. 7 The so called curse of dmensonalty (Bellman, 1961).

5 In spte of beng consdered a non-parametrc regresson technque, local modellng does have several parameters that must be tuned n order to obtan good predctve results. One of the most mportant s the noton of neghbourhood. Gven a query pont q we need to decde whch tranng cases wll be used to ft a local polynomal around the query pont. Ths nvolves defnng a dstance metrc over the multdmensonal space defned by the nput varables. Wth ths metrc we can specfy a dstance functon that allows fndng the nearest tranng cases of any query pont. Stll, many open ssues reman unspecfed. Namely, weghng of the varables wthn the dstance calculaton can be crucal n domans wth less relevant varables. Moreover, we need to specfy how many tranng cases wll enter the local ft (usually known as the bandwdth selecton problem). Even after havng a bandwdth sze specfcaton, we need to wegh the contrbutons of the tranng cases wthn the bandwdth. Nearer ponts should contrbute more nto the local ft. Ths s usually accomplshed through a weghng functon that takes the dstance to the query pont nto account (known as the kernel functon). The correct tunng of all these modellng parameters can be crucal for a successful use of local modellng. Local modellng provdes smooth functon approxmaton wth wde applcablty through correct tunng of the dstance functon parameters. Stll, ths s a computatonal ntensve method that does not produce a comprehensble model of the data. Moreover, these methods have dffcultes n dealng wth domans wth strong dscontnutes n the functon beng approxmated. INTEGRATING REGRESSION TREES WITH LOCAL MODELLING In ths secton we descrbe how we propose to overcome some of the lmtatons of both regresson trees and local modellng through an ntegraton of both methods that result n what we refer to as local regresson trees. The man goals of ths ntegraton can be stated as follows: Improve standard regresson trees smoothness, leadng to superor predctve accuracy. Improve local modellng n the followng aspects: Computatonal effcency. Generaton of models that are comprehensble to human users. Capablty to deal wth domans wth strong dscontnutes. In our study of local regresson trees we have consdered the ntegraton of the followng local modellng technques: Kernel models (Watson, 1964; Nadaraya, 1964). These models amount to fttng a polynomal of degree zero (.e. a constant) n the local neghbourhood. Local lnear polynomal (Stone,1977; Cleveland,1979; Katkovnk,1979). Where we ft a lnear polynomal wthn the neghbourhood. Sem-parametrc models or partal lnear models (Spegelman, 1976; Hardle, 1990). Consstng of fttng a standard least squares lnear model on all data plus a kernel model component on the resduals of the lnear polynomal that acts as a local correcton of the global lnearty assumpton of the polynomal. The man decsons concernng the ntegraton of local models wth regresson trees are how, and when to perform t. Three man alternatves exst: Assume the use of local models n the leaves durng all tree nducton (.e. growth and prunng phases). Grow a standard regresson tree and use local models only durng the prunng stage. Grow and prune a standard regresson tree. Use local models only n predcton tasks. The frst of these alternatves s more consstent from a theoretcal pont of vew as the choce of the best splt depends on the models at the leaves. Ths s the approach followed n RETIS (Karalc, 1992), whch ntegrates global least squares lnear polynomals n the tree leaves. However, obtanng such models s a computatonally demandng task. For each tral splt the left and rght chld models need to be obtaned and ther error calculated. Even wth a smple model lke the average, wthout an effcent ncremental algorthm the evaluaton of all canddate splts s too heavy. Ths means that f more complex models are to be used, lke lnear polynomals, ths task s practcally unfeasble for large domans 8. The experments descrbed by Karalc (1992) used data sets wth few hundred cases. The author does not provde any results concernng the computaton penalty of usng lnear models nstead of averages. Stll, we clam that 8 Partcularly wth large number of contnuous varables.

6 when usng more complex models lke kernel regresson ths approach s not feasble f we want to acheve a reasonable computaton tme. The second alternatve s to ntroduce the more complex models only durng the prunng stage. The tree s grown (.e. the splts are chosen) assumng averages n the leaves. Only durng the prunng stage we consder that the leaves wll contan more complex models, whch entals obtanng them for each node of the grown tree. Notce that ths s much more effcent than the frst alternatve mentoned before whch nvolved obtanng the models for each tral splt consdered durng the tree growth. Ths s the approach followed n M5 (Qunlan, 1992). Ths system also uses global least squares lnear polynomals n the tree leaves but these models are only added durng the prunng stage. We have access to a verson of M5 9 and we have confrmed that ths s a computatonally feasble soluton even for large problems. In our ntegraton of regresson trees wth local modellng we have followed the thrd alternatve. In ths approach the learnng process s separated from predcton tasks. We generate the regresson trees usng the standard methodology descrbed prevously. If we want to use the learned tree to make predctons for a set of unseen test cases, we can choose whch model should be used n the tree leaves. These can nclude complex models lke kernels or local lnear polynomals. Usng ths approach we only have to ft as many models as there are leaves n the fnal pruned tree. The man advantage of ths approach s ts computatonal effcency. However, t also allows tryng several alternatve models wthout havng to re-learn the tree. Usng ths approach the ntal tree can be regarded as a knd of rough approxmaton of the regresson surface whch s comprehensble to the human user. On top of ths rough surface we may ft smoother models for each data partton generated n the leaves of the regresson tree so as to ncrease the predctve accuracy. In Torgo (1999) a large expermental evaluaton of local regresson trees over a wde range of problems was carred out. Ths large set of expermental comparsons confrmed that local regresson trees are sgnfcantly more accurate than the standard regresson trees. However, these new regresson models are computatonally more demandng and less comprehensble than standard regresson trees. We have also observed that local regresson trees could overcome some of the lmtatons of local modellng technques, partcularly ther lack of comprehensblty and rather hgh processng tme. These experments have shown that local regresson trees are sgnfcantly faster than local modellng technques. Moreover, through the ntegraton wthn a tree-based structure we obtan a comprehensble nsght of the approxmaton of these models. Fnally, we have also observed that the modellng bas resultng from combnng local models and partton-based approaches mproves the accuracy of local models n domans where there are strong dscontnutes n the regresson surface. PREDICTING THE DENSITY OF ALGAE COMUNITIES In ths secton we descrbe the steps followed n the applcaton of RT n the context of the 3 rd Internatonal ERUDIT Competton. As we have mentoned ths competton conssted of a data analyss task concernng the envronmental problem of determnng the state of rvers and streams by montorng and analysng certan measurable chemcal concentratons. The goal of the task was to obtan a model that should allow makng predctons concernng the frequency dstrbuton of seven algae. Wth ths purpose the organsers have delvered two data fles to each compettor. The frst contaned the tranng data consstng of 200 samples descrbed by 11 nput varable plus the observed algae frequency dstrbutons. Ths tranng fle was presented as a matrx wth 200 rows and 18 (11+7) columns. From the 11 nput varables (labelled as A,,K) three were categorcal, namely the season, the rver sze and the flud velocty. All remanng nput varables were numerc. Several rver samples contaned unknown nput varable values (sgnalled by XXXX ). The compettors were also gven a second fle contanng the test data consstng of another 140 rver samples. The compettors were only gven the values of the 11 nput varables of these test samples. There were also unknown varable values n the test cases. Thus the test data set corresponded to a matrx of 140 rows and 11 columns of values. The goal of the competton was to buld a model based on the tranng data that could provde predctons of the frequency dstrbutons of the 7 algae (labelled as a,,g) for the 140 test samples, whose true values were only known to the organsers. The compettors should present ther results as a matrx of predctons. The solutons proposed by the compettors were evaluated by the sum of the total squared errors between ther predctons and the true values. 9 Verson 5.1

7 PROBLEM SOLUTION USING RT Ths secton brefly descrbes the steps taken to produce a matrx of predctons based on models obtaned wth RT. The frst step that we carred out was to dvde the gven tranng data nto seven dfferent tranng fles. The dvson was done for each of the 7 dfferent algae. Ths means that we dealt wth ths problem as seven dfferent regresson tasks. For each of these seven tranng sets all nput varable values were the same, and only the target varables changed. The second step of our proposed soluton conssted of fllng-n the unknown varable values. For each tranng case wth an unknown varable value we have searched for the 10 most smlar cases. Ths smlarty was asserted usng an Eucldean dstance metrc. The medan value of the varable of these 10 cases was used to fll n the unknown varable value. The methodology used n RT to ntegrate local models wth regresson trees enables ths system to emulate a large number of regresson technques through smple parameter settngs. In effect, as the predctons of the local models used n the leaves are obtaned completely ndependently of the tree growth phase we can very easly try dfferent varants of local regresson. Moreover, we can even use these models on all tranng data that would correspond to the orgnal local models. Ths later alternatve s easly accomplshed by prunng too much the ntal tree untl a sngle leaf s reached. Ths means that RT can emulate several regresson technques lke standard regresson trees, local regresson trees and local modellng technques. Moreover, several of the mplemented local modellng technques can be used to emulate other regresson models. For nstance, a local lnear polynomal can behave lke a standard least squares lnear regresson model through approprate parameter tunng 10. In resume, RT hybrd character can be used to obtan the followng types of models: Standard least squares (LS) regresson trees. Least absolute devaton (LAD) regresson trees. Regresson trees (LS or LAD) wth least squares lnear polynomals n the leaves. Regresson trees (LS or LAD) wth kernel models n the leaves. Regresson trees (LS or LAD) wth k nearest neghbors n the leaves. Regresson trees (LS or LAD) wth least squares local lnear polynomals n the leaves. Regresson trees (LS or LAD) wth least squares partal lnear polynomals (models that ntegrate both parametrc and non-parametrc components) n the leaves. Least squares lnear polynomals (wth or wthout backward elmnaton to smplfy the models). Kernel models. K nearest neghbors. Local lnear polynomals Partal lnear polynomals. Havng so many varants n RT the next step of our analyss was to try to fnd out whch were the most promsng technques for each of the seven regresson tasks. Wth ths purpose we have carred out the followng experment. We have selected a large set of canddate varants of RT. For each of the seven tranng sets we have carred out the followng experment wth the goal of estmatng the predctve accuracy (measured by the Mean Squared Error) of each varant. The MSE was estmated usng a 10-fold Cross Valdaton experment. Each tranng set was randomly dvded nto 10 approxmately equally szed folds. For each fold, a model was bult wth the remanng nne and ts predcton error calculated n the fold. Ths process was repeated for the 10 dfferent folds and the predcton errors were averaged. Ths 10-fold Cross Valdaton process was repeated 10 tmes wth 10 dfferent random permutatons of each tranng set. The fnal score of each varant was obtaned by averagng over these 10 repettons. For each of the seven algae the most promsng varants of RT were collected together wth ther estmated predcton error. Ths model selecton stage mmedately revealed that the seven regresson tasks posed qute dfferent challenges. In effect, there was a large dversty of technques that were estmated as the most promsng dependng of the alga n consderaton. For each of the seven algae tranng sets we have used the respectve most promsng RT varants to obtan a model. These models were them appled n the gven 140 test cases and ther predctons collected. The fnal predctons that 10 In ths example t s enough to use an nfnte bandwdth (meanng that all tranng ponts contrbute to obtan the model) and a unform weghng functon (whch gves equal weghts to all tranng ponts).

8 were submtted as our solutons were obtaned by a weghed average of these predctons. To better llustrate ths weghng schema, magne that for alga a the most promsng RT varants and ther estmated predctve accuracy were: M, Err ; M, Err ; M, Err ; M, Err a M a The fnal predcton for a test case M a y = b M b c M c d M d x would be calculated as: ( x ) Err + M ( x ) Err + M ( x ) Err + M ( x ) M a b Err M a + Err M b M b c + Err M c + Err M c M d d Err M d where, x s a test case; y s the predcton for case x ; and M a ( x ) s the predcton of model M a for case x. DISCUSSION The soluton obtaned through the process descrbed before was declared by the jury of the ERUDIT competton as one of the three runner-ups wnners. After the announcement of the results the organsaton provded the true values of the test samples and ths enabled us to analyse n more detal the performance of our proposed data analyss technque. The total sum of squared error of our soluton was , whch corresponds to a MSE of At the tme of wrtng the scores of the other compettors are not known so t s a bt hard to nterpret the value of these numbers. Stll, we have collected other statstcs that could provde further nsghts on the qualty of the soluton. We have calculated the mean absolute devaton (MAD) of the predctons of RT 11. The followng graph shows the mnmun MAD, maxmum MAD and an nterval (represented by a box) between MAD±Standard Error, for all seven algae. Absolute Error a b c d e f g Algae Fgure 2: The Scores of our Soluton n term of Absolute Error. From ths fgure t s clear that the best results are obtaned for predctng the frequency of algae c, d and g, whle the worst predctons occur wth algae a and f. Stll, the overall Mean Absolute Error s whch s a relatvely low value Ths statstc s more meanngful as t s n the same scale as the goal varable. 12 The values of the algae dstrbutons range from 0 to 100 (although the maxmum value n the tranng samples was 89.8 for alga a).

9 Wth the true goal varable values we were able to confrm that the rankng of RT varants obtaned through repeated 10-fold Cross Valdaton was most of the tmes correct. We have also carred out some ntal experments to try to dentfy the man causes for the good performance of our soluton. These experments clearly show that the avalablty of dfferent varants was hghly benefct because worse results are obtaned when tryng the same varant over the seven regresson tasks. As an example, whle our multstrategy weghed soluton acheved an overall MSE of on all seven tasks, usng the best RT varant (Torgo, 1999), whch conssts of usng local regresson trees wth partal lnear models n the leaves, we obtaned a MSE of Moreover, the weghng schema does also brng some accuracy gans when compared to the alternatve of always selectng the predcton of the best estmated varant for each task 13. We thnk that usng more sophstcated model combnaton methodologes lke boostng (Freund, 1995) or baggng (Breman,1996) should provde even more nterestng accuracy results. CONCLUSIONS We have descrbed the applcaton of a regresson analyss tool named RT to an envronmental data used n the 3 rd Internatonal ERUDIT Competton. RT mplements a new regresson methodology called local regresson trees. Ths technque can be regarded as a combnaton of two exstng methodologes, regresson trees and local modellng. We have descrbed the ntegraton schema used n RT to obtan local regresson trees. We have shown how ths schema s mportant n obtanng a tool that s able to approxmate qute dfferent regresson surfaces. The results obtaned by RT n the competton provde further confdence on the correctness of our ntegraton schema. The methodology used to obtan a soluton for the competton was based on the assumpton that usng dfferent varants of RT on each of the seven regresson tasks of the competton provded better results than always usng the same regresson methodology. Our experments and the results of the competton confrmed ths hypothess. Moreover, averagng over the best varants dd also mprove the accuracy of the resultng predctons. Ths applcaton provdes further evdence towards the advantages of systems ncorporatng multple data analyss technques, and also towards the accuracy advantages of combnng predctons of dfferent models. REFERENCES Atkeson,C.G., Moore,A.W., Schaal,S., 1997, Locally Weghted Learnng, Artfcal Intellgence Revew, 11, Specal ssue on lazy learnng, Aha, D. (Ed.). Bellman,R., 1961, Adaptve Control Processes. Prnceton Unversty Press. Breman,L., Fredman,J.H., Olshen,R.A. & Stone,C.J., 1984, Classfcaton and Regresson Trees, Wadsworth Int. Group, Belmont, Calforna, USA. Breman,L., 1996, Baggng Predctors, Machne Learnng, 24, (p ). Kluwer Academc Publshers. Broadley,C., Utgoff,P., 1995, Multvarate decson trees, Machne Learnng, 19, Kluwer Academc Publshers. Cleveland, W., 1979, Robust locally weghted regresson and smoothng scatterplots, Journal of the Amercan Statstcal Assocaton, 74, Cleveland,W., Loader,C., 1995, Smoothng by Local Regresson: Prncples and Methods (wth dscusson), n Computatonal Statstcs. Fan,J., 1995, Local Modellng, n Encyclopdea of Statstcal Scence. Freund,Y., 1995, Boostng a weak learnng algorthm by majorty, n Informaton and Computaton, 121 (2), Gama,J., 1997, Probablstc lnear tree, n Proceedngs of the 14 th Internatonal Conference on Machne Learnng, Fsher,D. (ed.). Morgan Kaufmann. Hardle,W., 1990, Appled Nonparametrc Regresson. Cambrdge Unversty Press. 13 Whle the weghed average obtaned a total sum of squared error of (MSE=87.020), usng the predctons of the best varant we get a total of (MSE=89.296).

10 Haste,T., Tbshran,R., 1990, Generalzed Addtve Models. Chapman & Hall. Karalc,A., 1992, Employng Lnear Regresson n Regresson Tree Leaves, n Proceedngs of ECAI-92. Wley & Sons. Katkovnk,V., 1979, Lnear an nonlnear methods of nonparametrc regresson analyss, Sovet Automatc Control, 5, Moore,A. Schneder,J., Deng,K., 1997, Effcent Locally Weghted Polynomal Regresson Predctons, n Proceedngs of the 14 th Internatonal Conference on Machne Learnng (ICML-97), Fsher,D. (ed.). Morgan Kaufmann Publshers. Murthy,S., Kasf,S., Salzberg,S., 1994, A system for nducton of oblque decson trees, n Journal of Artfcal Intellgence Research, 2, Nadaraya, E.A., 1964, On estmatng regresson, Theory of Probablty and ts Applcatons, 9: Parzen,E., 1962, On estmaton of a probablty densty functon and mode, Annals Mathematcal Statstcs, 33, Qunlan, J., 1992, Learnng wth Contnuous Classes, n Proceedngs of the 5th Australan Jont Conference on Artfcal Intellgence. Sngapore: World Scentfc, Rosenblatt,M., 1956, Remarks on some nonparametrc estmates of a densty functon, Annals Mathematcal Statstcs, 27, Spegelman,C., 1976, Two technques for estmatng treatment effects n the presence of hdden varables: adaptve regresson and a soluton to Reersol s problem. Ph.D. Thess. Dept. of Mathematcs, Northwestern Unversty. Stone, C.J., 1977, Consstent nonparametrc regresson, The Annals of Statstcs, 5, Torgo,L., 1997a, Functonal models for Regresson Tree Leaves, n Proceedngs of the Internatonal Conference on Machne Learnng (ICML-97), Fsher,D. (ed.). Morgan Kaufmann Publshers. Also avalable n Torgo,L., 1997b, Kernel Regresson Trees, n Proceedngs of the poster papers of the European Conference on Machne Learnng (ECML-97), Internal Report of Faculty of Informatcs and Statstcs, Unversty of Economcs, Prague, ISBN: Also avalable n Torgo,L., 1998, A Comparatve Study of Relable Error Estmators for Prunng Regresson Trees, n Proceedng of the Iberoamercan Conference on Artfcal Intellgence, Coelho,H. (ed.), Sprnger-Verlag. Also avalable n Torgo,L., 1999, Inductve Learnng of Tree-based Regresson Models, Ph.D. Thess, to be presented at the Department of Computer Scence, Faculty of Scences, Unversty of Porto. Soon avalable n Watson, G.S., 1964, Smooth Regresson Analyss, Sankhya: The Indan Journal of Statstcs, Seres A, 26 :

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn