REG^2: A Regional Regression Framework for Geo-Referenced Datasets

Size: px

Start display at page:

Download "REG^2: A Regional Regression Framework for Geo-Referenced Datasets"

Caroline Reeves
5 years ago
Views:

1 REG^2: A Regonal Regresson Framework for Geo-Referenced Datasets Oner Ulv Celepckay Unversty of Houston Department of Computer Scence Houston, TX, onerulv@cs.uh.edu Chrstoph F. Eck Unversty of Houston Department of Computer Scence Houston, TX, ceck@cs.uh.edu ABSTRACT Tradtonal regresson analyss derves global relatonshps between varables and neglects spatal varatons n varables. Hence they lack the ablty to systematcally dscover regonal relatonshps and to buld better models that use ths regonal knowledge to obtan hgher predcton accuraces. Snce most relatonshps n spatal datasets are regonal, there s a great need for regonal regresson methods that derve regonal regresson functons that reflect dfferent spatal characterstcs of dfferent regons. Ths paper proposes a novel regonal regresson framework that frst dscovers nterestng regons showng strong regonal relatonshps between the dependent and the ndependent varables, and then bulds a predcton model wth a regonal regresson functon assocated wth each regon. Interestng regons are dentfed by runnng a representatve-based clusterng algorthm that maxmzes an externally plugged n ftness functon. In ths work, we propose two ftness functons: an R- squared based ftness functon and an AIC-based ftness functon to handle overfttng better. We evaluate our framework n two case studes; () dentfyng causes of arsenc contamnaton n Texas water wells and (2) Boston Housng dataset determnng spatally varyng effects of house propertes on house prces. We demonstrated that our framework effectvely dentfes nterestng regons and bulds better predcton systems that rely on regonal models. Categores and Subject Descrptors H.2.8. [Database Applcatons]: Spatal databases and GIS, Data Mnng General Terms Algorthms, Desgn, and Expermentaton Keywords Spatal Data Mnng, Regresson Analyss, Regonal Knowledge Dscovery, Regonal Regresson, Clusterng. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage, and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ACM GIS '09, November 4-6, Seattle, WA, USA (c) 2009 ACM ISBN /09/...$0.00. INTRODUCTION Regresson analyss has been extensvely used n many scentfc felds to dscover relatonshps and dependences among varables and many varatons of regresson analyss have been proposed n the lterature [2]. In spatal data mnng, regresson analyss can be used to dscover spatal relatonshps and to buld models for predcton. In geo-referenced datasets, most relatonshps exst at regonal level not global level. Snce often ths type of spatal or regonal relatonshps are not explctly represented n georeferenced datasets, one major task n spatal data mnng s to develop technques and methodologes to automate the extracton of nterestng and useful regonal patterns, and to buld better models that reflect the spatally varyng characterstcs of georeferenced data. Moreover, capturng regonal varaton wll lead to a deeper understandng of mportant relatonshps that are embedded n a spatal dataset. For example, a global lnear regresson analyss on housng prces n a cty would derve coeffcents that measures each attrbute s contrbuton to the prce of a house. However, coeffcents tend to vary from one regon to another; for example, the attrbute have_pool mght have a coeffcent of 9,000 n a cty wde regresson analyss whch ndcates that havng a pool adds $9,000 to the house value, when n realty dependng on the neghborhood, the pool adds value between $5,000 and $50,000 to the house prce, and ts contrbuton dffers regonally; e.g. t s much lower for houses n some underdeveloped parts of the cty. We clam that understandng these regonal varatons wll not only lead to more accurate predcton models, but wll also provde regonal background knowledge concernng whch attrbutes have a sgnfcant mpact on house prces n whch regons. Therefore t s desrable to develop methods that capture ths spatal varaton and extract regonal knowledge from spatal datasets. Ths type of regonal knowledge s crucal for doman experts who seek to obtan a deeper understandng of regonal varatons n relatonshps among varables n geo-referenced datasets. Some local regresson methods have been proposed n the lterature but most use pre-defned regon boundares lke zp codes, county lmts, or grd structures based on spatal coordnate systems. However, n spatal data mnng, regonal patterns often do not concde wth predefned geographc boundares and mportant spatal relatonshps mght not be dscovered due to dluton. For example, proxmty to a rver mght sgnfcantly add to the value of the house; but, f census blocks are used as regons, ths pattern wll not be detected especally f the pre-defned regons contan a lot of houses that are not very close to the rver. In general, such relatonshps can only be dscovered by methods that seek for regons on ther own that reflects the underlyng

2 structure of the data, e.g. regons that contan houses wth smlar relatonshps between dependent and ndependent varables. Ths paper proposes a Regonal Regresson framework called REG^2 (pronounced as REG-squared) that focuses on dscoverng regonal regresson functons that are assocated wth contguous areas n the subspace of the spatal attrbutes whch we call regons. Fgure shows an example of dscovered regons along wth ther regonal regresson functons and regon representatves where the regonal regresson functon for regon 0 s gven as an example. Frst, nterestng regons are dscovered by runnng a representatve-based clusterng algorthm that maxmzes an externally plugged-n ftness functon; next, regonal knowledge s extracted from the obtaned subspaces (regons). We developed two ftness functons: a R-squared-based ftness functon (RsqFtness) and an AIC-based (Akake s Informaton Crteron) ftness functon (AICFtness). The two ftness functons are used to gude the search for regons wth strong regonal lnear relatonshps between the response varable and the ndependent varables. 2. RELATED WORK 2. Regresson Analyss and AIC Regresson Analyss has been extensvely used n many scentfc felds to dscover lnear relatonshps and dependences among varables and many varatons of regresson analyss exsts n the lterature [2]. OLS s the best known of all regresson technques, whch provdes a global model of the varable of the nterest that needs too be understood or predcted. R 2 s a measure of the extent to whch the total varaton of the dependent varable s explaned by the model. In general, ncreasng the number of ndependent varables nvolved n regresson wll lead to a hgher R 2 value. R 2 alone s not a good measure for goodness of ft of a model snce t only deals wth the bas of the regresson model, and gnores the complexty of a model and ts assocated varance, and s therefore prone to overfttng. Consequently, several nformaton crtera have been developed that take the complexty of the employed model nto consderaton, wth. Akake's Informaton Crteron (AIC)[] beng one of the most popular nformaton crtera to measure the goodness of ft of an estmated statstcal model. AIC provdes a balance between bas and varance, and s estmated usng the followng formula: AIC = 2 log L( θ ) + 2k () where L( θ ) s the maxmzed lkelhood functon, θ s the maxmum lkelhood estmate of the parameter vector θ under the model and k s the number of the ndependent parameters of the model. Assumng that the errors are normally dstrbuted, the AIC formula becomes; AIC = 2 k + n[ln(2 π. SSE / n) + ] (2) Fgure. Dscovered Regons and Regresson Functons The man contrbutons of the paper nclude:. A regonal regresson framework that employs representatvebased clusterng to dscover nterestng regons and ther assocated regonal regresson functons, wthout usng any predefned regon boundares. 2. An AIC-based and an Rsq-based ftness functon to gude the search for regons wth hghly accurate regresson functons. 3. A correlaton-based methodology to evaluate the performance of the employed ftness functons and to select ftness functon parameters automatcally. 4. An expermental evaluaton of the framework n a case study that centers on ndentfyng causes of arsenc contamnaton n Texas water wells and on Boston Housng dataset determnng spatally varyng effects of house propertes on house prces. The remander of the paper s organzed as follows: In secton 2, we dscuss related work. Secton 3 provdes a detaled dscusson of our regon dscovery framework, the AIC-based ftness functon and R-squared based ftness functon. Secton 4 presents the expermental evaluaton, and secton 5 concludes the paper. where SSE s the Resdual Sum of Squares whch s called as RSS n some lterature and n s the number of objects. McQuarre and Tsa [9] defne a specal varaton of AIC u for small regons; SSE n + k AICu = ln + n k n k 2 Increasng the number of free parameters to be estmated mproves the goodness of ft, regardless of the number of free parameters n the data generatng process. Hence AIC not only rewards goodness of ft, but also ncludes a penalty that s an ncreasng functon of the number of estmated parameters. Ths penalty dscourages overfttng. The preferred model s that wth the lowest AIC value. In summary, the AIC methodology attempts to fnd the model that best explans the data wth a mnmum of free parameters. 2.2 Regresson Trees Regresson Trees are another local statstcal predcton model whch recursvely parttons data nto small parttons and then ft a smple model to these small parttons. The early Classfcaton and Regresson Tree (CART) algorthm [4] selects the splt varable and splt value that mnmzes the weghted sum of the varances of the target values n the two subsets. The selecton of frst attrbute to splt n regresson trees dramatcally affects the resulted regons and that causes lack of flexblty. Snce data s splt greedly usng a top-down approach, regons n regresson trees are rectangular. Regresson trees also am to fnd local statstcs, but our approach s more flexble snce t employs an externally plugged-n ftness functon to be maxmzed rather than evaluaton varance of splttng on a sngle attrbute lke (3)

regresson trees employ and also performs wder non-greedy search; moreover, shapes of regons that can be dscovered by our approach can be convex polygons, whch represent Vorono cells whereas regresson

3 regresson trees employ and also performs wder non-greedy search; moreover, shapes of regons that can be dscovered by our approach can be convex polygons, whch represent Vorono cells whereas regresson trees are lmted to dscovery rectangle shape regons snce they dscover regons by recursvely splttng trees nto 2 sub-trees n a top down fashon. Our approach, on the other hand, searches for the optmal set of regons teratvely by modfyng regon representatves maxmzng an external, plug-n ftness functon. 2.3 Geographcally Weghted Regresson Geographcally weghted regresson (GWR) [5] s an nstancebased, local spatal statstcal technque used to analyze spatal non-statonarty. GWR generates maps used n explorng and nterpretng spatal non-statonarty. Instead of calbratng a sngle regresson equaton, GWR generates a separate regresson equaton for a set of observaton ponts that are usually determned usng a grd structure. Each equaton s calbrated usng a dfferent weghtng of the observatons contaned n the data set, based on the proxmty of observatons to observaton ponts. Each GWR equaton may be expressed as: y (, ) (, ) = β0 u v + βk u v xk + ε (4) where (u,v ) denotes the coordnates of the th pont and β k (u, v ) s a realzaton of contnuous functon β k (u, v) at pont [2]. Usng OLS, the parameters for a lnear regresson model can be obtaned by solvng: k β = ( X T X) T X Y Smlarly, the parameter estmates for GWR may be solved usng a weghtng scheme: T T (, ) ( (, ) ) (, ) (5) β u v = X W u v X X W u v y (6) The weght assgned to each observaton s based on a dstance decay functon (w j ) centered on observaton. Recently, GWR has been used n many research works for nvestgatng a varety of topcal areas, ncludng clmatology [5], urban poverty [8], envronmental justce [20], and the ecologcal nference problem [7]. GWR s smlar to our approach especally n amng to capture the spatal varance of attrbutes over space whch n GWR s called spatal non-statonarty. Our approach, on the other hand, does not use any observaton ponts, predefned grd structures but t dscovers polygon-shaped regons. In general, GWR assumes autocorrelaton and wll have problems for spatal datasets where patterns change sharply. For example, f we have two neghborng regons that are characterzed by very dfferent regresson functons, GWR wll have sgnfcant errors near the boundary of the two regons, because t uses multple regresson functons of nearby observaton ponts and not a sngle regon-specfc regresson functon as does our approach. In general, n our approach a regresson functon s assgned to each regon where coeffcent estmates are smlar, reflectng the exstence of smlar pattern of dependency between the dependent and ndependent varable n a partcular regon. 2.4 Cluster-Wse Regresson (CLR) The cluster-wse regresson technque ncorporates cluster analyss nto the OLS regresson analyss. The smplest cluster-wse regresson s a 2-cluster lnear regresson, whch was ntroduced by Spath [25, 26, 27], and was further developed by other researchers. In summary, nstead of usng the classcal homogenety or separaton crteron, cluster-wse regresson s based on the accuracy of a lnear regresson model assocated to each cluster usng the sum of squared predcton errors for each object n a cluster. Ths technque has many applcatons, due to ts beng well suted to market segmentaton and product prcng [2, 6]. The mathematcal formulaton of CLR s: gven; n K m 2 e = k ( jk j + 0k ) (7) = k= j= Mn z y b x b K zk = =... n and k = z {0,} =... n k =... K. k where x j and y are respectvely the value of the varable j and the value of the dependent varable for the observaton. b jk s the j th regresson coeffcents for the cluster k and z j s a bnary varable that equals f and only f the observaton belongs to cluster k. n s the number of observatons, m the number of ndependent varables consdered and K the number of clusters. The man objectve of CLR s to mnmze the sum of squared predcton error for each object usng the equaton of the cluster that the object belongs to. Our approach s smlar but our framework tres to mnmze AIC of each regon rather than only the predcton error and capablty. Moreover, our approach searches for the optmal number of regons rather that assumng that the correct number of regons s known n advance and supports arbtrary, plug-n ftness functons whch provde flexblty and extensblty. More mportantly, as demonstrated n [6] by Brusco et al., CLR has tremendous potental for overfttng snce mnmzng the sum of the error sums of squares for the wthncluster regresson models makes no effort to dstngush the error explaned by the wthn-cluster regresson models from the error explaned by the clusterng process. 3. METHODOLOGY We now ntroduce the components of our regonal regresson framework, shown n Fgure 2. The framework dscovers nterestng regons by runnng a representatve-based clusterng algorthm that maxmzes an externally plugged n ftness functon. Regonal representatves and regonal regresson functons are assocated wth regons to support predcton. Fgure 2. Regonal Regresson (REG 2 ) Framework

4 3. Regon Dscovery Framework We employ the regon dscovery framework that was proposed n [3, 4]. The objectve of regon dscovery s to fnd nterestng places n spatal datasets regons occupyng contguous areas n the spatal subspace. In ths work, we extend ths framework to extract regonal regresson functons. The framework ncorporates doman knowledge nto doman-specfc plug-n ftness functons that are maxmzed by the clusterng algorthm. The framework employs a reward-based evaluaton scheme to evaluate the qualty of the dscovered regons. Gven a set of regons R={r,,r k } wth respect to a spatal dataset O ={o,,o n }, the ftness of R s defned as the sum of the rewards obtaned from each regon r j (j =,,k): k q( R) = sum[ ψ ( r, j)] = ( r )* sze( r ) β (8) j= where (r j ) s the nterestngness of the regon r j a quantty based on doman nterest, reflectng the degree to whch the regon s newsworthy. Ftness functons are the core components n the framework, as they capture a doman expert s noton of nterestngness. The framework seeks for a set of regons R such that the sum of rewards over all of ts consttuent regons s maxmzed. In general, the parameter β controls how much premum s put on regon sze. The sze (r j ) β component n q(r), (β ) ncreases the value of the ftness nonlnearly wth respect to the number of objects n the regon r j. ψ(r,j) s the reward of the regon and regon reward s proportonal to ts nterestngness, but gven two regons wth the same value of nterestngness, a larger regon receves a hgher reward to reflect a preference gven to larger regons. Rewardng regon sze non-lnearly ensures mergng neghborng regons whose coeffcent estmates are smlar reflectng exstence of smlar pattern of spatal varance of attrbutes wthn each regon. 3.2 CLEVER Algorthm We employ the CLEVER [3] clusterng algorthm to fnd nterestng regons n the expermental evaluaton. CLEVER s a representatve-based clusterng algorthm that forms clusters by assgnng objects to the closest cluster representatve and seeks for the optmal set of representatves wth respect to q(r). The algorthm starts wth a randomly created set of representatves and employs randomzed hll clmbng by samplng s neghbors of the current clusterng soluton as long as new clusterng solutons mprove the ftness value. To battle premature convergence, the algorthm employs re-samplng: f none of the s neghbors mproves the ftness value, then t more solutons are sampled before the algorthm termnates. After CLEVER termnates regon representatves and ther assocated regresson functons are returned. 3.3 R-squared Ftness Functon (RsqFtness) The natural queston n assessng the estmated model s: How well does t ft the data? In our framework we need a ftness functon that wll be optmzed by the clusterng algorthm that reflects the man characterstc of good regresson models: hgh accuracy. In regresson analyss one of the most popular evaluaton measures s R-squared (R 2 ). Our approach uses the followng R 2 -based nterestngness measure: j j Defnton : (Rsq-based Interestngness Rsq (r) ) SSE f n MnRegSze Rsq ( r)= SST f n < MnRegSze 0 R 2 mathematcally s equal to -SSE/SST. SSE s Sum of Squares of Resduals or Errors and SST s Total Sum of Squares and they are defned as: n 2 SSE = ( y y ) and 2 SST = ( y y) The RsqFtness functon then becomes: k q ( R)= ( r )* sze( r ) β Rsq Rsq j j j= n (9) (0) MnRegonSze s a controllng parameter to battle the tendency towards havng very small sze regons wth maxmal varance n regresson analyss so that we do not end up wth regons wth a few objects that have very hgh R 2 values. The experments conducted usng the R-sq-based ftness functon suggest that usng R-sq alone frequently leads to lower predcton accuraces on unseen example due to overfttng. We need a better model selecton crteron to balance the tradeoff between bas and the varance. Therefore we developed another ftness functon that s based on an nformaton crteron, namely AIC. 3.4 AIC-based Ftness Functon (AICFtness) Akake s Informaton Crteron (AIC) s one of the most commonly used measures of goodness of an estmated model. We prefer to use AIC because t takes model complexty nto consderaton; moreover, we beleve AIC takes the number of observatons more effectvely nto consderaton; there are many varatons of AIC ncludng AICu whch s used for small sze data whch we beleve s good ft for small sze regons. Snce lower AIC value ndcates a better model and our framework tres to maxmze the ftness functon, we wll use /AIC as the nterestngness measure. We use AICu f the number of object s less than a predefned threshold th, as defned n equaton (3). Our work employs the nterestngness measure n defnton 2 to assess the strength of relatonshps between the dependent varable and the ndependent varables n a regon r (also see Secton 2. for further explanaton): Defnton 2: (AIC-based Interestngness AIC (r) ) f n th 2 k + n[ln(2 π. SSE / n) + ] AIC ( r)= f n < th SSE n + k ln + n k n k 2 AICFtness functon then becomes: k q ( R)= ( r )* sze( r ) β AIC AIC j j j= () (2) Our ftness functon repeatedly apples Regresson analyss durng the search for the optmal set of regons, mnmzng the AIC value n that regon. Havng an externally plugged n AIC-based ftness functon enables the clusterng algorthm to probe for the

5 optmal parttonng and encourages the mergng of two regons that exhbt structural smlartes. 3.5 The Predcton Schema The value of the dependent varable for an object s determnes as follows: frst, the model fnds the closest regon representatve usng a -nearest neghbor query and then usng the regresson functon assocated wth ths regon t predcts the value of the dependent varable. The pseudo-code of ths schema s llustrated n fgure 3. Β O r s the regresson ntercept value for the regon r and β j r s the regresson equaton slope coeffcent for the ndependent varable j n regon r. SSE_TE s the Resdual Sum of Squares of testng-set and SSE_TR s the Resdual Sum of Squares of tranng-set. for each object n new_set { -fnd closest regon representatve; -retreve the regresson functon coeffcents assocated wth the regon (β 0r and β jr for all j s); -estmate the predcted value of dependent varable (ŷ) by usng these coeffcents; -usng observed value of dependent varable(y) and predcted value (ŷ) estmate SSE_TE(REG 2 ); -do same usng global model SSE_TE(GL) } - output the regonal and global RSS (SSE) for new_set Fgure 3. The pseudo-code for predcton schema 4. EXPERIMENTAL EVALUATION 4. Objectves and Desgn of the Experments One mportant objectve of expermental evaluaton s to evaluate the benefts of employng representatve-based clusterng to dscover regonal regresson functons. Another mportant objectve s to compare the performances of varous ftness functons, such as AICFtness and RsqFItness functons wth respect to predcton accuracy. In Secton 4.2, a correlaton-based methodology for ftness functon parameter selecton s ntroduced. We also assess how the regons found usng our approach compare wth regons that are selected randomly. In order to evaluate the true performance of our methodologes and to make far comparsons, our expermental benchmark s composed of four steps. Frst, we apply OLS regresson on the global data. Then we use our framework to dscover regons and regonal regresson functons usng both R-sq-based ftness functon and AIC-based ftness functon. We compare the R 2 value of newly dscovered regons vs. global R 2 and also, more mportantly, the Resdual Sum of Squares (SSE) of new results vs. global SSE to determne the total accuracy mprovement. One could argue that any method that dvdes data nto subregons ncreases the R 2 value and reduces the SSE value, snce smaller numbers of objects are nvolved. Ths s true to some extent but to show that the regons dscovered by our framework are sgnfcantly better than randomly selected regons, we also run experments where regons are randomly dscovered wthout usng any ftness functons. Ths step s repeated fve tmes and the average of the SSE s taken to be far to the random regon dscovery. So each dataset was evaluated usng the followng four benchmarks for each parameter settng: ) Global Regresson, 2) Random Regons, 3) R-sq Ftness Functon, 4) AIC-based ftness functon. We have used 5-fold cross valdaton to determne predcton accuracy n the experments. The SSE values presented n tables or fgures represent the values from all 5 folds. 4.2 Parameter Selecton Methodology Utlzng AIC addresses overfttng wthn a regon, but we stll must deal wth global overfttng, whch can be attrbuted to havng too many regons n regonal regresson model. The ftness functon parameter β s used to control the number of regons to be dscovered and thus overall model complexty. The challenge s to fnd a good value for β for a gven dataset that strkes the rght balance between underfttng and overfttng for a gven dataset. By choosng larger β values model complexty can be penalzed, f ths desrable for the partcular applcaton. Ths rases the queston how we can determne good values for β and other ftness functon parameters whch s the subject of the remander of ths secton. In order to fnd best parameter combnatons we have developed a parameter selecton methodology whch s based on gvng preference for ftness functons that demonstrate a hgh negatve correlaton between regon reward and the regon predcton error on unseen data. N- fold cross valdaton s used to assess the predcton error on tranng and test data. Below, we descrbe n how ths correlaton (and others nvolvng tranng accuracy) are computed n detal: Let - ψ be the regon reward functon - CRV the set of tranng-set test-set pars to be used n n-fold cross-valdaton - O(r) be the set of objects belongng to regon r - SSE_TE(r) be the test-set sum of squares error of r - SSE_TR(r) be the tranng-set sum of squares error of r - r be the number of tranng set objects belongng to r FOR EACH (tranng-set, test-set)-par n CRV DO. Compute Regons R usng tranng-set; 2. FOR EACH regon r R DO a. determne number of objects of the test-set that belong to r; b. determne SSE error for tranng set and test set objects n r c.store(ψ(r),sse_te(r)* r ) & TS-OBJECTS(r),SSE_TR(r)) 3. Compute correlatons between the stored entres n the table Bascally we determne for each test set of each foldng whch objects belong to whch regon, and then we determne the predcton error for those objects usng SSE and compute the correlaton 2 between regonal rewards and regonal predcton errors whch are expected to be negatve snce hgher rewardedregons are expected to have lower errors. Table and Table 2 report average correlatons between regonal rewards and regonal predctons errors usng dfferent values of β. Some normalzaton have to performed to cope wth sze dscrepances between tranng set objects belongng to a regon (whch determne rewards) and test set objects belongng to a regon (whch determne the error on unseen example). 2 Addtonally, we compute correlatons between regonal tranng accuracy and regonal testng accuracy and regon rewards.

6 Table. Reward & Predcton Error Correlatons n Arsenc Arsenc Dataset Beta Values AICFtness Functon Corr(Reward, SSE_TR) Corr(Reward, SSE_TE) Table 2. Reward & Predcton Error Correlatons - Housng Boston Housng Data Beta Values AICFtness Functon Corr(Reward, SSE_TR) Corr(Reward, SSE_TE) The correlaton estmates suggest that there s a negatve correlaton among regonal reward assgned to a regon and ts predcton error whch s an ndcaton of the capablty of our framework to dentfy hghly correlated regons that capture and mnmze the spatal varaton among attrbutes. Table 3 lsts experments that wll be dscussed n the remander of the paper, whose parameters have been selected usng the prevously descrbed parameter selecton methodology. The datasets used n the experments are both real datasets and are explaned next. Table 3. Fnal set of experments and parameters used Common parameters Beta Values(β) Ftness Functon Dataset Exp#.0 RsqFtness Arsenc Exp# 2.03 RsqFtness Arsenc Exp# 3.05 RsqFtness Arsenc Exp# 4. RsqFtness Arsenc Exp# 5.0 AICFtness Arsenc Exp# 6.03 AICFtness Arsenc Exp# 7.05 AICFtnes Arsenc Exp# 8. RsqFtness Arsenc Exp# 9.03 Both Boston Housng Exp# 0.05 Both Boston Housng Exp#. Both Boston Housng Exp# 2.7 Both Boston Housng 4.3 A Real World Case Study: Texas Water Wells Arsenc Project Arsenc s a deadly poson, and long-term exposure to even very low arsenc concentratons can cause cancer [28]. Therefore, t s extremely crucal to understand factors that cause hgh arsenc concentratons to occur. In partcular, we are nterested n dentfyng other attrbutes that contrbute sgnfcantly to the varance of arsenc concentraton. Datasets used n the experments were created usng the Texas Water Department Ground Water Database [28] that samples Texas water wells regularly. The datasets were generated by cleanng out duplcate, mssng and nconsstent varables and aggregatng the arsenc amount when multple samples exst. Our dataset has 3 spatal and 0 non-spatal attrbutes. Longtude, Lattude and Aqufer ID are the spatal attrbutes and Arsenc(As), Molybdenum(M), Vanadum(V), Boron(B), Fluorde(F), Slca(SO2), Chlorde(Cl), Sulfate(SO4) are 8 of the non-spatal attrbutes whch are chemcal concentratons. The other 2 nonspatal attrbutes are Total Dssolved Solds (TDS) and Well Depth (WD). The dataset has,653 objects. 4.4 Boston Housng Data-Corrected We also evaluated our framework usng the corrected verson of Boston Housng Data whch contans 506 census tracts of Boston from the 970 census. We use the corrected verson snce ths verson ncludes addtonal spatal nformaton ncludng Longtude and Lattude. The dataset was taken from the StatLb lbrary[27] mantaned at Carnege Mellon Unversty. We used MEDV the medan value of homes n USD 000's as the target (dependent) varable and 8 of other varables as ndependent varables. These varables nclude CRIM-crme rate, NOX-ntrc oxdes concentraton, RM- average number of rooms, AGE-proporton of owner-occuped unts bult pror to 940, RAD-ndex of accessblty to radal hghways, TAX-property tax rate, PTRATIO-pupl-teacher rato, B-black proporton populaton, and LSTAT- percentage of lower status of the populaton. We omtted some less mportant attrbutes from ths regresson. 4.5 Arsenc Dataset Results 4.5. SSE Improvements The total Sum of Square of Resduals (SSE) for the global model, Rsq ftness, AIC ftness and the model wth randomly dscovered regons s shown n fgure 4. Ths SSE values are the Sum of Square of Resduals estmated from the tranng data. As shown n the fgure, both models wth R-sq ftness functon and AIC-based ftness functon reduce SSE sgnfcantly. For example, where global SSE s 76,34 the model wth RsqFtness model reduces t to 3,578 and the model wth AIC-ftness reduces t to 26,974. SSE that the model usng randomly dscovered regons produces s 52,76 whch stll s an mprovement over global model as expected snce lower number of objects are nvolved but both our models outperforms Random model as well. In order to llustrate the SSE mprovement better, the mprovements n percentages are shown n table 4. Res d ual S u m S quar es (S S E ) 80K 70K 60K 50K 40K 30K 20K 0K K. 0 Beta Global Random R-sq Ftness A I C -Ftness Fgure 4. SSE values of four models

7 Table 4. SSE Improvements over Global Model Beta Values Random R-sq Ftness AIC- Ftness.0 3% 59% 65%.03 22% 64% 68%.05 26% 57% 69%. 32% 37% 69% The results show that usng randomly selected regons slghtly mproves the accuracy whch s expected due to nvolvement of less objects. But both the models generated by our framework reduce predcton error sgnfcantly. RsqFtness reduces the error by a percentage of hgh 50s and AICFtness model reduces by almost 70% and produces the best results. The comparson of these three models s sketched n fgure 5 for comparson. Table 5. R 2 value mprovements usng REG 2 Regon R 2 Value Sze Texas 68.2% 655 Regon % 95 Regon % 24 Regon 96.34% 20 Regon % 00 Regon % 3 Regon % 25 Regon % 92 Regon % 22 Regon % 30 Regon 6 9.6% % 7 0 % 6 0 % 5 0 % Improvement n SSE 4 0 % 3 0 % 2 0 % 0 % 0 % Random R-sq Ftness A I C -Ftness Beta Values Table 6. R 2 value mprovements usng Random Regons Regon R 2 Value Sze Texas 68.2% 655 Regon % 47 Regon % 26 Regon % 26 Regon % 80 Regon % 34 Regon % 42 Regon % 33 Regon 73.70% 52 Regon % 34 Regon % 32 Fgure 5. SSE mprovements over Global Regresson R 2 and AIC value mprovements The dscovered regons llustrate great mprovements n R 2 values. R 2 value was for the global data whch means 68.2% of the arsenc varaton can be explaned by other 7 chemcal varables for Texas-wde data. The R 2 values for the top 0- ranked regons n experment 2 are gven n Table 5. As can be seen from the table, our framework dscovers regons that show hgh correlaton; and n these 0 regons, on average 95% of arsenc varaton can be explaned usng other chemcal varables. For example, R 2 value ncreased from 68% to 98.% n Regon 34 wth 95 water wells, whch ndcates that n ths regon there exst stronger correlatons between arsenc and the chemcals. The R 2 mprovements wth random regon model are also estmated for each experment. There s some mprovement as expected snce the regons are much smaller than global, but mprovements are much less than the mprovements provded by the RsqFtness model. The R 2 values for the top 0-ranked regons usng Random regon model n experment 2 are gven n Table 6 as an example and they wll not be provded for other experments due to lmted space. The hgh R 2 values of regons n Table 5 ndcates that our framework successfully dscovers regons along wth ther regonal regresson functons whch represent better model compared to lnear regresson appled to global data. The average R 2 values of regons n Table 5 s whch means on average only around 80% of the arsenc varaton can be explaned by other chemcals. Besdes, the hgh R 2 value of Regon 37 and Regon 9 domnate the average, otherwse the mprovement observed n all other regons can be accepted as nsgnfcant. We now provde the AIC value mprovements n Table 7. Table 7. AIC value mprovements n Arsenc Regon AIC Value Sze Texas, Regon Regon Regon Regon Regon

8 Agan snce lower AIC value ndcates better model, n ftness functon /AIC was the ftness value to be maxmzed. AIC values n dscovered regons show great mprovement compared to global AIC. Snce the AICFtness model provdes best accuracy as far as Resdual Sum of Squares goes as shown n Table 4, these mprovements also ndcate the regons of captures spatal varaton and mnmze the varaton wthn each regon (clusters). 4.6 Boston Housng Dataset Results 4.6. SSE Improvements The SSE values of Boston Housng data for the 4 models descrbed prevously s shown n fgure 6. As shown n the fgure, both models wth R-sq ftness functon and AIC-based ftness functon reduce SSE sgnfcantly. For example n an experment wth Beta value of.0, global SSE s 0,444 and the model wth R-sq ftness reduces t to 2,056 and the model wth AIC-ftness reduces t to,96. SSE that the model usng randomly dscovered regons produces s 4,65 for these experments and more than 7,000 n 2 of experments. SSE mprovements n percentages are shown n table 8. Res d ual S u m S quar es (S S E ) 2,000 0,000 8,000 6,000 4,000 2, Beta Values Global Random R-sq Ftness A I C -Ftness Fgure 6. SSE Values of 4 models for Boston Housng Data Table 8. SSE_TR mprovements n Boston Housng Data Beta Values Random Regons R-sq Ftness AIC- Ftness.0 60% 8% 82%.03 58% 77% 80%. 3% 70% 8%.7 27% 55% 83% The SSE_TR mprovements n the Boston Housng dataset experments are much better than Arsenc Dataset snce ths dataset has more correlaton and more spatal varance so the search for regons to mnmze AIC or maxmze Rsq value provdes better regons and as a result better predcton accuracy. As far as SSE_TE s concerned Boston Housng Data also performs better than Arsenc dataset. Due to space lmtaton all results are not provded but table 9 exemplfes SSE_TE(REG 2 ) and SSE_TE(GL) calculatons whch was descrbed n predcton schema n secton 3.5. As shown n the table, our framework dscovers regons and ther regonal regresson coeffcents that perform better predcton compared to the global model. In many regons better predcton s provded and the percentage of such regons s also provded n Table 9. Some regons wth very hgh predcton error reduces the overall predcton accuracy mprovement but stll there s a sgnfcant 27% reducton n testng error whch s open for mprovement for future work. Table 9. SSE_TE mprovements n Boston Housng Data β SSE % of regons SSE_TE SSE_TE (GL) (REG 2 Improvement predcton better ). 7,82 2,566 27% 72%.7 20,028 4,799 26% 65% 4.7 Dscusson of Regonal Characterstcs Global and regonal regresson results show that the relatonshp of the arsenc concentraton wth other chemcal concentratons spatally vares and s not constant over space, whch provdes a motvaton for regonal knowledge dscovery. In other words, there are sgnfcant dfferences n arsenc concentratons n water wells across varous regons n Texas. Some of these dfferences are found to be due to the varyng mpact of the ndependent varables on the arsenc concentraton. In order to exemplfy ths we wll compare the global regresson model wth the regresson functon of a partcular regon. The result of the global regresson and regonal regresson of regon 0 are shown n Tables 0 and, respectvely whch followed by a dscusson these results. Table 0. Regresson Result for Global Data As Coef. Std.Er t P> t Mo V B Fl SO Cl SO const R-squared Value: 68% Adjusted R-squared Value 68% The global OLS regresson result suggests that Molybdenum, Vanadum, Boron, and Slca ncrease the arsenc concentraton, but Sulfate and Fluorde decrease t Texas-wde. Moreover, Chlorde (Cl) and Sulfate (SO4) are not sgnfcant for global predctors of arsenc concentraton but n some regons for example regon 0 they are sgnfcant. Conversely, Boron, Fluorde, and Slca are globally sgnfcant and hghly correlated wth arsenc, but ths s not the case n regon 0. Ths nformaton s crucal to doman experts who seek to determne the controllng

9 factors for arsenc polluton, as t can help reveal hdden regonal patterns and specal characterstcs for ths regon. For example, n ths regon, hgh arsenc level s hghly correlated to hgh Sulfate and Chlorde levels, whch s an ndcaton of external factors that play a role n ths regon, such as a nearby chemcal plant or toxc waste. Our framework s able to successfully detect such hdden regonal assocatons between attrbutes. The global and regonal regresson results show that the relatonshp of arsenc concentraton wth other chemcal concentratons spatally vares, and s not constant over space. In addton, there are unexplaned dfferences that are not accounted for by our ndependent varables, whch mght be due to external factors, such as toxc waste or the proxmty of a chemcal plant Table. Regresson Result for Regon 0 As Coef. Std.Er t P> t Mo V B Fl SO Cl SO const R-squared Value 95.03% Adjusted R-squared Value 93.73% 4.8 Implementaton Platform and Effcency The components of the framework descrbed n ths paper were developed usng an open-source, Java-based data mnng and machne learnng framework called Cougar^2[9], whch has been developed by our research group []. All experments were performed on a machne wth.79 GHz of processor speed and 2GB of memory. The parameter β s the most mportant factor n determnng the run tme. The run tmes of the experments wth respect to the β values used are shown n Fgure 7. RunTme (n mlseconds) 9,000 8,500 8,000 7,500 7,000 6,500 6,000 5,500 5,000 4,500 4,000 3,500 3,000 2,500 2,000,500, β Arsenc RsqFtness A r s e n c A I C F t n e s s B o s t o n H o u s ng R s q F t ne s s Boston Housng AICFtness Fgure 7. Run Tmes vs. β Values In arsenc dataset the β value of.05 takes longer than. or.5 but t provdes better results as far as SSE mprovements and better testng and tranng set error correlaton are concerned. So the tme s consumed n explorng better solutons. The Boston Housng dataset experments take less tme manly because of two reasons: ) t has fewer objects 2) more mportantly as results provded n secton 4 suggest ths dataset has more correlaton and more spatal varance so the search for regons to mnmze AIC or maxmze Rsq value takes less tme compared to Arsenc data. We observed that more than 80% of the computatonal resources are allocated for determnng regonal ftness values when dscoverng regons. Even though our framework repeatedly apples regresson analyss to each explored regon combnaton untl no further mprovement occurs, t s stll effcent compared to approaches n whch regresson s appled and models were compared that many tmes usng other statstcal tools. 5. CONCLUSION Ths paper proposes a novel regresson framework for spatal datasets, whch focuses on dscoverng regonal regresson functons that are assocated wth contguous areas n the subspace of the spatal attrbutes that we call regons. Unlke other research that employs local or global regresson, our approach emphaszes regonal patterns and provdes a comprehensve methodology to help dscover them. The proposed framework dscovers regons by employng representatve-based clusterng algorthms that maxmze AIC-based or Rsq-based ftness functons; next, regonal knowledge (regresson functons n our case) s extracted from the obtaned regons. In general, dfferent dscovered regons capture dfferent relatonshps between dependent and ndependent varables. Ths regonal knowledge s crucal for doman experts for understandng the underlyng structure of the data. We also developed a generc correlaton-based methodology to evaluate and select ftness functons and parameters of ftness functons for a gven dataset. Ths capablty s crtcal to help deal wth overfttng n regonal regresson for more accurate predcton. Compared to competng approaches, our approach does not assume that regons are a pror gven, and provdes sophstcated clusterng technques to fnd good regons; moreover, we provde a methodology to deal wth overfttng and for selectng ftness functons that are sutable for the gven data. We clam that none of the competng approaches address both ssues n ther proposed frameworks. Our proposed framework was tested and evaluated n a real world case study that analyzes regonal correlaton patterns among arsenc and other chemcal concentratons n Texas water wells. The framework was also tested on the corrected verson of the Boston Housng dataset. We demonstrated that our framework can effectvely and effcently dentfy hghly correlated regons, along wth the regonal regresson functons whch capture the spatal varaton of attrbutes better than global models and bulds better models for predcton. We plan to develop other versons of AICFtness functon where the ftness functon s scaled to penalze bad regons more harshly. We also plan to nvestgate other nformaton crtera lke ICOMP [3] and BIC [23] and regonal regresson approaches that use valdaton sets to get a better handle on overfttng.

10 6. REFERENCES [] Akake, H. Informaton theory and an extenson of the maxmum lkelhood prncple. In B.N. Petrov and F. Csake (eds.), Second Internatonal Symposum on Informaton Theory. Budapest: Akadema Kado, , 973 [2] Appled Regresson Analyss, Wley Seres n Probablty and Statstcs; Fox, J. (997). [3] Bozdogan, H. ICOMP: a new model-selecton crteron. In Classfcaton and Related Methods of Data Analyss, H. H. Bock (Ed.), Elsever Scence Publshers, Amsterdam; , 988. [4] Breman, L., Fredman, J., Olshen, R., Stone, C.: Classfcaton and Regresson Trees, Wadsworth, Belmont, CA, 984. [5] Brunsdon, C., McClatchey, J. and Unwn, D. (200). Spatal varatons n the average ranfall alttude relatonshps n Great Brtan: an approach usng geographcally weghted regresson, Internatonal Journal of Clmatology, 2, [6] Brusco, M. J., Cradt, J. D., Stenley, D., & Fox, G. L. (2008). Cautonary remarks on the use of clusterwse regresson. Multvarate Behavoral Research, 43 (), [7] Calvo, C. and Escolar, M. (2003). The local voter: a geographcally weghted approach to ecologcal nference, Amercan Journal of Poltcal Scence, 47, [8] Choo, J.,Jamthapthaksn, R., Sheng Chen, C., Celepckay, O.U., Gust,C., Eck, C.F.: MOSAIC: A proxmty graph approach for agglomeratve clusterng. The 9th Int l Conference on Data Warehousng & Knowledge Dscovery (DaWaK 2007),Germany, 2007 [9] Cougar^2 Framework, [0] Cresse, N.: Statstcs for Spatal Data (Revsed Edton). New York: Wley, 993. [] Data Mnng and Machne Learnng Group, Unversty of Houston, [2] Dng, W., Jamthapthaksn. R., Parmar R., Jang D., Stepnsk, T., and Eck, C.F., Towards Regon Dscovery n Spatal Datasets, n Proc. Pacfc-Asa Conference on Knowledge Dscovery and Data Mnng (PAKDD), Osaka, Japan, May [3] Eck, C.F., Parmar, R., Dng, W., Stepnsk, T., Ncot, J.P.: Fndng Regonal Co-locaton Patterns for Sets of Contnuous Varables n Spatal Datasets, n Proc. 6th ACM SIGSPATIAL Internatonal Conference on Advances n GIS (ACM-GIS), Irvne, Calforna, November [4] Eck, C.F., Vaezan, B., Jang D.,, J.: Dscoverng of nterestng regons n spatal data sets usng supervsed clusterng. Proc. of the 0th European Conference on Prncples of Data Mnng and Knowledge Dscovery, Berln, Germany, [5] Fotherngham, A.S., Brunsdon, C. & Charlton, M.: Geographcally Weghted Regresson: The Analyss of Spatally Varyng Relatonshps. John Wley, 2002 [6] J.M. Aurfelle. A bo-mmetc clusterwse regresson algorthm for consumer segmentaton. In Advances n Computatonal Management, pages 45 63, 998. [7] Johnson,R.A.: Appled Multvarate Analyss. Englewood Clffs, N.J. : Prentce Hall, c992 [8] Longley, P. A. and Tobon, C. (2004). Spatal dependence and heterogenety n patterns of hardshp: an ntra-urban analyss, Annals of the Assocaton of Amercan Geographers, 94, [9] McQuarre A. D., and Tsa, C. (998), "Regresson and Tme Seres Model Selecton", World Scentfc Publshng Co. Pte. Ltd., Rver Edge, NJ. [20] Menns, J. and Jordan, L. (2005). The dstrbuton of envronmental equty: explorng spatal non-statonarty n multvarate models of ar toxc releases, Annals of the Assocaton of Amercan Geographers, 95, [2] M.J. Brusco, J. D Cradt, and S. Stahl. A smulated annealng heurstc for a bcrteron parttonng problem n market segmentaton. Journal of Marketng Research, 39:99 09, [22] Openshaw, S "Geographcal data mnng: key desgn ssues". GeoComputaton 99: Proceedngs Fourth Internatonal Conference on GeoComputaton, Mary Washngton College, Fredercksburg, Vrgna, USA, July 999. [23] Schwarz, G. (978). Estmatng the dmenson of a model. Annals of Statstcs, 6, [24] Spath, H., Clusterwse lnear regresson, Computng, 22 (4), , 979. [25] Spath, H., Correcton to Algorthm 39: Clusterwse Lnear Regresson. Computng, 26, 275, 98. [26] Spath, H., Algorthm 48: A Fast Algorthm for Clusterwse Lnear Regresson. Computng, 29, 75-8,982. [27] STATLIB Probablty and Statstcs Lbrary [28] Texas Water Development Board, [29] Woolrdge, J.: Econometrc Analyss of Cross-Secton and Panel Data. MIT Press, 2002, pp. 30, 279,

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.