Should SDBMS Support a Join Index?: A Case study from CrimeStat

Size: px

Start display at page:

Download "Should SDBMS Support a Join Index?: A Case study from CrimeStat"

Jesse Foster
5 years ago
Views:

Wlson Natonal Insttute of Justce Washngton D.C Ronald.Wlson@usdo.gov Betsy George Department of Computer Scence Unversty of Mnnesota bgeorge@cs.umn.

1 Should SDBMS Support a Jon Index?: A Case study from CrmeStat Pradeep Mohan Department of Computer Scence Unversty of Mnnesota mohan@cs.umn.edu Shash Shekhar Department of Computer Scence Unversty of Mnnesota shekhar@cs.umn.edu Ned Levne Ned Levne and Assocates Houston, TX ned@nedlevne.com Ronald E. Wlson Natonal Insttute of Justce Washngton D.C Ronald.Wlson@usdo.gov Betsy George Department of Computer Scence Unversty of Mnnesota bgeorge@cs.umn.edu Mete Celk Department of Computer Scence Unversty of Mnnesota mcelk@cs.umn.edu ABSTRACT Gven a spatal crme data warehouse, that s updated nfrequently and a set of operatons O as well as constrants of storage and update overheads, the ndex type selecton problem s to fnd a set of ndex types that can reduce the I/O cost of the set of operatons. The ndex type selecton problem s mportant to mprove user experence and system resource utlzaton n crucal spatal statstcs applcaton domans such as mappng and analyss for publc safety, publc health, ecology, and transportaton. Ths s because the response tme of frequent queres based on the set of operatons can be mproved sgnfcantly by an effectve choce of ndex types. Many spatal statstcal queres n these applcaton domans make use of a spatal neghborhood matrx, known as W n spatal statstcs, whch can be thought of as a spatal self-on n spatal database termnology. Currently supported ndex types such as B-Tree and R-Tree famles do not adequately support spatal statstcal analyss because they requre on-the-fly computaton of the W- Matrx, slowng down spatal statstcal analyss. In contrast, ths paper argues that Spatal Database Management Systems (SDBMS) should support a on ndex to materalze the W- Matrx and elmnate on-the-fly computaton of the common selfon. A detaled case study usng the popular spatal statstcal software package for publc safety, namely CrmeStat, shows that on ndces can sgnfcantly speed up spatal analyss such as calculaton of Rpley s K and dentfcaton of hotspots. Categores and Subect Descrptors H.2.2[PHYSICAL DESIGN]: Access methods General Terms Desgn, Expermentaton Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ACM GIS '8, November 5-7, 28. Irvne, CA, USA (c) 28 ACM ISBN /8/...$5. Keywords Jon Index, Spatal Statstcs, W Matrx, Self-Jon. INTRODUCTION Gven a spatal crme data warehouse that s updated nfrequently and a set of operatons, the spatal ndex type selecton problem s to fnd a set of spatal ndex types that can reduce the I/O cost of the set of operatons under gven constrants of storage and update overheads. The ndex type selecton problem s mportant to mprove user experence, response tme, and system resource utlzaton. For example, n tools such as CrmeStat[6], the response tme for dentfcaton of hotspots s 2 hours for a dataset sze of 5 crme reports. Ths slow response tme occurs because CrmeStat s a man memory tool. Usng spatal ndex types e.g., a on ndex famly, may lower the response tme to a few mnutes, thereby enhancng the user experence. Fgure : Identfcaton of Hotspots Ths paper focuses on spatal statstcal queres n the context of mappng and analyss for publc safety. The applcaton consders questons such as, "Are there spatal concentratons of crme that warrant ncreased polce targetng at the communty, cty, and county levels?" to dentfy a set of spatally grouped nstances defned as hotspots. For example, Fgure llustrates the dentfcaton of burglary hotspots n applcatons such as mappng and analyss of publc safety. In these scenaros, law

2 enforcement agences normally have very lmted resources such as offcers and patrol vehcles to deploy n a concentrated manner. Gven a large area such as a cty, t would be very useful for law enforcement agences to examne dfferent possble confguratons for dstrbutng ther lmted resources to areas where there s ncreased crme actvty. To perform ths strategc placement, crme analysts and law enforcement agences perform an exploratory analyss. Smlar spatal statstcal queres are mportant n many other applcaton domans such as publc health, transportaton, ecology, consumer applcatons etc. For example, publc health authortes may be nterested n hotspots of dseases such as cancer clusters [4] n order to dentfy and remedy envronmental factors such as contamnated sol or water. Transportaton professonals may be nterested n dentfyng and remedyng spatal concentratons of traffc accdents by re-desgnng transportaton networks va traffc calmng etc. Ecologsts may dentfy spatal concentratons of endangered speces to promote ther protecton. Many of these queres make use of a spatal neghborhood matrx known as W n spatal statstcs and perform repeated W Matrx computaton for dfferent neghborhoods. We call such queres W-Queres. Current spatal database management systems (SDBMS) provde a rch set of operatons and spatal ndex structures such as B-Tree and R-Tree ndex famles that can enhance the effcency of processng queres n varous applcatons [, 3, 4,,, 2, 7]. However, these SDBMS must perform repeated on-the-fly computaton of the W-Matrx and are lmted n ther ablty to support W-Query operatons. Ths paper argues that SDBMS should support a self-on ndex. The paper ams to establsh the utlty of the on ndex to process W-Queres effcently by evaluatng the dea of a self-on ndex. Related Work: Research related to the ndex type selecton problem can be classfed nto two categores: () spatal ndces that make use of on-the-fly on computaton strateges by computng ons as a part of the query evaluaton process, (2) spatal ndces used for drect(lookup) on computaton that compute ons by performng a sequence of lookups. On-the-fly on computaton technques that are based on spatal ndces, namely the R-Tree and ts varants, are sutable for computng the spatal on for a sngle neghborhood relatonshp [, 3, 4, 5, 6, 7, 8,,, 7, 8, 9]. However, W-Queres are exploratory n nature and requre repeated self-on computaton, makng on-the-fly on computaton expensve. Spatal ndces such as R-Tree and ts varants, Quad tree, Grd Fles, etc., have been ncorporated as a part of commercal SDBMS systems [7, 8, 9, 9, 6]. IBM Informx Spatal DataBlade makes use of the R Tree, ESRI Arc SDE makes use of Grd Fles, Oracle spatal makes use of R Tree and Quad tree, and Mcrosoft SQL Server 28 spatal data support makes use of mult-level Grd fles. These commercal SDBMS tools retan the lmtatons of the correspondng spatal ndces and hence do not provde support for W-Queres and ther operatons. A maor ssue faced by exstng SDBMS tools to support several spatal ndces s that the choce of a spatal ndex type for a gven set of workloads affects the strategy for I/O optmzaton, query optmzaton strateges, concurrency control, and recovery strateges. Drect on computaton technques are based on spatal ndces such as the (spatal) on ndex [8,3]. Jon ndces have been prmarly used n the context of computng a spatal on between two dfferent relatons to speed up onlne query processng nfrequently updated databases. However, current on ndces are represented as b-partte graphs [9, 4]. By contrast, W-Queres are prmarly focused on computng several spatal self-on operatons. In self-on cases, the on ndex becomes a neghborhood graph rather than a b-partte graph representaton. Hence, the current representaton of on ndces as b-partte graphs needs re-consderaton. Our Contrbutons: Frst, we characterze the computatonal structure of W-Queres. We consder the computaton of Rpley's K Functon, and the dentfcaton of hotspots [2,6] for modelng W-Queres. We propose a set of operatons for handlng these queres. We defne the spatal ndex selecton problem for handlng the set of operatons effcently. We propose two varants of the self-on ndex namely: (a) the Self-on Edge Lst Index (SJELI) and (b) the Self-on Adacency Lst Index (SJALI). We also propose algorthms for processng W-Queres. We evaluate the I/O effcency of the proposed varants of the self-on ndex usng algebrac cost models for the operatons. The cost model and the expermental results establsh the utlty of the self-on ndces. Expermental results usng real crme datasets ndcate that the self-on ndces decrease the user response tme of W-Queres by a factor 4 compared to a sngle threaded verson of CrmeStat and outperform an R-Tree based Tree Matchng self-on algorthm. Based on these fndngs we beleve that exstng SDBMS should adopt the self-on ndces to support spatal statstcal queres such as W-Queres. Scope: Ths paper prmarly focuses on the selecton of a sutable ndex type for a gven set of operatons. The on ndces are materalzed for the study area, a prmary requrement n most spatal statstcal analyss applcatons. Our propostons are manly focused on multple spatal neghborhood analyss queres wthn a partcular study area. The am s to reduce the response tme of the proposed set of operatons for W-Queres n a spatal crme data warehouse settng. We understand that addng new ndex types n a SDBMS s a complex decson due to the mpact on ssues such as concurrency control, recovery, and evaluaton of storage costs. These ssues are beyond the current scope of the paper. Outlne: The rest of the paper s organzed as follows. Secton 2 presents the basc concepts and the spatal ndex type selecton problem. Secton 3 descrbes the proposed self-on ndex varatons and desgn decsons. In Secton 4, we propose two algorthms for two example W-Queres, e.g., Rpley's K Functon computaton and dentfcaton of hotspots, and propose an algebrac cost model for the set of W-Query operatons. The expermental evaluaton s gven n Secton 5 and Secton 6 outlnes the conclusons and future work. 2. Basc Concepts and Problem Statement In ths secton, we present some basc concepts requred to model W-Queres. We model W-Queres by dentfyng two example queres, namely the computaton of the Rpley K- Functon and the dentfcaton of hotspots. We propose a set of W-Query operatons based on the example queres and characterze ther computatonal structure.

3 N 7 N 6 N 5 N 4 N 3 N 2 N N (a) N N 2 N 3 (c) N 4 N 5 N 6 N 7 N 7 N 6 N 5 N 4 N 3 N 2 N (b) N N 2 N 3 N 4 N 5 N 6 N 7 (d) Fgure 2: Sample dataset and the W-Matrx for dfferent relatons. (a) Neghborhood graph for neghborhood relaton R, (b) Neghborhood graph for relaton R2. (d) W-Matrx for relaton R. (f) W-Matrx for relaton R2. In spatal statstcs, the W-Matrx s a matrx--based representaton of space and a measure of the adacency, proxmty, dstance or level of spatal nteracton between spatal nstances [3]. Gven a unform spatal framework and a set of spatal nstances, W-Queres re-compute the W-Matrx for dfferent neghborhood relatons. For example: Fgure 2a and 2b represent the spatal neghborhood graph for a spatal dataset. Fgure 2a corresponds to a neghborhood relaton R, and Fgure 2b corresponds to a neghborhood relaton R2. The correspondng W- Matrces for the neghborhood graphs s llustrated n Fgure 2c and 2d respectvely. Spatal nstances are represented by N,, n a unform spatal framework. In the W-Matrx, a denotes that the two spatal nstances satsfy the neghborhood relaton and a denotes that the two spatal nstances do not satsfy the neghborhood relaton. Defnton 2. Gven two spatal nstances S, and S, where, n a spatal dataset S D a neghborhood relaton R(S, S ) can be defned as a measure of spatal nteracton, dstance or adacency. For example, In Fgure 2, R and R2 are two dfferent spatal neghborhood relatons. Defnton 2.2 Gven a spatal framework S, the W-Matrx s defned as a set of values that quantfy the spatal nteracton, dstance or adacency. These values can be bnary or real dependng on the measure of spatal nteracton used. Formally, the W-Matrx can be defned as follows[3]; W ( S D, R) = { R( S, S ) S, S S D and R( S, S ) s vald and } Defnton 2.3 Gven a spatal nstance S, the no_of_nstances(s, R) of nstance S s the number of spatal nstances S є S D,, that satsfy the neghbor relaton R. For example, n Fgure 2, R, R2 are two dfferent spatal neghborhood relatonshps whose no_of_nstances(n,r) = 3 and no_of_nstances(,r2) = 4 Defnton 2.4 Gven a spatal nstance S, the average edge weght (AEW) (or average weght) of a spatal nstance s the sum of the values of R(S, S ) dvded by the Frequency(S,R) where S є S D and, that have a vald neghbor relaton R. The term average edge weght s relevant only f the neghbor relaton represents a value of dstance or smlarty. N N 2. Two Smple W Queres Neghbor Relaton R Neghbor Relaton R Neghbor Relaton R2 Neghbor Relaton R2 (a) (b) Fgure 3: Computatonal structure of W-Queres. (a) Rpley's K (b) Hotspots To model W-Queres, we consder two spatal statstcal queres that have been appled to compute statstcs n CrmeStat[6]. Query I: Is data spatally clustered?. Query I relates to the calculaton of a well-known statstcal measure called Rpley's K functon [2, 4]. Ths measure calculates the cumulatve number of spatal nstances that are wthn a search radus of each spatal nstance n the dataset. Ths cumulatve count s computed for dfferent neghborhood rad. Fgure 3(a) llustrates the method of computng Rpley s K Functon. In the fgure, dark crcles around the spatal nstances N,.., and represent neghborhood relatonshp R, and dashed crcles around the spatal nstances represent neghborhood relatonshp R2. The Rpley K Functon method computes the number of spatal nstances around a partcular spatal nstance for a partcular neghbor relaton R2 and reports the cumulatve sum of these frequences over all spatal nstances. The process s repeated after the neghbor relaton s changed to R and so on untl a sgnfcant number of levels are completed. The number of neghborhood relatonshps s of the order of n spatal statstcs tools such as CrmeStat [6]. Query II: Are there concentratons of crme that warrant ncreased polce targetng at the communty, cty, and county level? Query II relates to the dentfcaton of a spatally grouped set of nstances defned as hotspots. Fgure 3(b) llustrates hotspots that can be extracted from the spatal dataset for multple neghborhood defntons. N, and are the spatal nstances. In the fgure, dark ellpses refer to hotspots that are dentfed for a neghborhood R and the dashed ellpse refers to hotspots that are dentfed for a neghborhood R2. The computatonal process begns wth the computaton of the W-Matrx for an ntal neghborhood relaton R and the selecton of a set of representatve ponts called seeds. Seeds are defned as spatal nstances whch have a mnmal edge weght compared to ther neghbor spatal nstances. For example, n Fgure 3(b),,, and are the seed ponts snce they have mnmum average edge weghts? for the neghbor relaton R. The hotspot dentfcaton process always mantans a lst of potental seeds that are updated whenever a new hotspot s dentfed. The key challenge n the process s to dentfy nonoverlappng hotspots so that spatal nstances are not reconsdered n subsequent hotspots. N

4 Table : W-Queres from CrmeStat[6] Statstc W(S D,R) Consecutve W Subsets Frequency Based Average Edge Weght Based Jon Computaton: On the Fly Jon Computaton: Look up Rpley's K Functon Yes Yes Yes NO NO Yes Nearest Neghbor Statstc Yes Yes Yes NO NO Yes Hotspots Yes Yes Yes Yes NO Yes Moran s I NO NO NO NO Yes Yes Geary s C NO NO NO NO Yes Yes Local Moran (LISA) Yes NO NO NO NO Yes 2.2 Case Study: W Queres from CrmeStat Spatal statstcal queres that can be classfed as W-Queres and that manly nvolve repeated computaton of neghborhood relatonshps are drawn from crme analyss tools such as CrmeStat [6]. Table lsts some of these queres. CrmeStat has several spatal autocorrelaton routnes ncludng Moran s I, Geary s C and LISA. These are global level statstcs that determne f there s clusterng or dsperson wthn a dataset across a study area. They are used as a gude to conduct local level hotspot analyss whereby f the results ndcate there s no clusterng or dsperson, then any hotspots found wth local level technques wll lkely be false postves. These spatal statstcal measures can also be modeled as W-Queres. 2.3 Operatons for W-Queres W-Queres can be modeled as a set of operatons that can be used to dentfy a sutable spatal ndex type to process them effcently. Fgure 4 llustrates the effect of the set of operatons on the example dataset llustrated by Fgure. Snce the spatal dataset s modeled as a neghborhood graph under a neghborhood relaton, we make use of termnology used n the spatal network database lterature such as predecessor and successor [7]. We make use of node colorng to dstngush a predecessor from a successor as the operatons are appled on a neghborhood graph. get-neghbors-n-relatonshp(s,r): Identfy the neghbors of a spatal nstance S. Gven the spatal nstance S, the get-neghbors-n-relatonshp() operaton colors the spatal nstance S and gves all the neghbors that satsfy the relatonshp R the same color as S. For example: Fgure 4 (a) shows the effect of the get-neghborsn-relatonshp(s,r) on the spatal nstance where the operaton get-neghbors-n-relatonshp(,r) results n the colorng of the nstances,, and. get-successors (S ): Retreve the successors of S. The successor of a spatal nstance S s defned as a set of spatal nstances that satsfy the neghbor relaton R wth S and have the same color. For example: Fgure 4 (b) shows the effect of the getsuccessor(s ) operaton on the spatal nstance, where the nstances N,,, and are reported as successors snce they have the same color as. get-successor (S ): Retreve the farthest unreported successors of S. Ths operaton returns the spatal nstance whch s the successor of S and has the maxmum value of the neghbor relaton R wth S. We call ths the "farthest successor frst " strategy. For example: Fgure 4(c) shows the effect of the get-successor(s ) operaton on the spatal nstance, where the nstances N,,, and are reported as successors snce they have the same color as that of. get-predecessors (S ): Retreve the predecessors of S. Retreves the spatal nstances that have a color dfferent from that of spatal nstance S. Ths operaton s executed normally when the degree of spatal nstances requres updatng. For example: Fgure 4 (f), shows the result of get-predecessors(s ) on the spatal nstance. The operaton reports nstances and as the results. get-predecessor-of-successor (S ): Retreve the predecessors of the successor of S Ths operaton returns the nearest uncolored spatal nstance to the successor of S. A predecessor s a spatal nstance S that does not have the same color as spatal nstance S. For example: Fgure 4(d) shows the result get-predecessor-ofsuccessor(s ) appled two tmes on the spatal nstance. The operaton reports nstances and as the results. get-predecessors-of-successor (S ): Retreve the predecessors of the successors of S. Ths operaton retreves the predecessors of the successor of a spatal nstance S. Ths operaton s mportant to update the average edge weght of neghborng spatal nstances of the neghbors of S. For example: Fgure 4(g) shows the result of ths operaton on the spatal nstance,where the frst successor of s N and ts frst predecessor s gets reported. update-successors (S, <successors>): Un-colors all the successors of S Checks whether the spatal nstance S s colored; f t s colored then t un-colors the spatal nstance. <successors> represents a lst of successors to be updated.

5 N N N N (a) (b) (c) (d) N N N N (e) (f) (g) (h) Colored Spatal Instances Successors Unmarked Instance Requres Update R R2 Predecessors Fgure 4 Effect of W-Query operatons on sample dataset. (a)get-neghbors-n-relatonshp(,r). (b) getsuccessors(). (c) get-successor() (d)get-predecessor-of-successor() appled two tmes. (e) updatesuccessors(). (f) get-predecessors().(g) get-predecessors-of-successor(). (h) get-predecessor(). For example: Fgure 4 (e) shows the result of updatesuccessors(s ) on the spatal nstances and. update-average edge weght (S ): Update the average edge weght of a spatal nstance. Ths operaton updates (reduces) the average edge weght of a gven spatal nstance S. For example: Ths operaton s appled on the nstances and, whch are shown n Fgure 4(f,g). s updated two tmes n ths example. 2.4 Problem Statement Ths secton defnes the spatal ndex type selecton problem gven a set of operatons that are relevant to W-Queres. Gven: A spatal crme data warehouse A set of operatons O Fnd: A sutable secondary memory ndex structure type. Obectve: To mnmze the I/O cost of the set of operatons O. Constrants: Spatal datasets are updated nfrequently. Concurrency control and recovery consderatons are addressed separately. There are no storage overheads. User response tme s mnmzed. Example: To compute a W-Query such as the Rpley K Functon, gven a spatal dataset and a set of operatons, namely getneghbors-n-relatonshp() and get-successors(). The obectve of the above problem s to fnd a sutable spatal ndex type that mnmzes the I/O cost of the operatons get-neghbors-nrelatonshp(), get-successors() and the user response tme of the W-Query. Dfferent W-Queres may have dfferent workloads whch are provded as an nput to the query. For example, Rpley's K has parameters such as maxmum neghborhood sze and number of spatal neghborhoods. 3. Self-Jon Index and Its Varants In ths secton, we formally defne a self-on ndex (SJI) and propose two varants, namely the Self-Jon edge lst ndex (SJELI) and the Self-on adacency lst ndex (SJALI). We formally defne the self-on ndex as: SJI = { < S, S R( S, S, R( S, S ) s vald) & ) > S, S } S D & ( R R where S D s the spatal dataset, R S s a set of neghborhood relatonshps that are defned for a spatal framework S. For example: From Fgure 5, R S = {R,R2}. R(S,S ) s ether R or R2. 3. Representatons of the SJI Tradtonally, the on ndex has been represented as a bpartte graph. Snce W-Queres repeatedly compute self-ons, the modelng of the self-on ndex as a b-partte graph needs to be modfed to that of an undrected neghborhood graph, G=(S D, E). The neghborhood graph G conssts of a set of spatal nstances S D and an edge set E. Each element S єs D s a spatal locaton n a unform spatal framework S. The set of edges E s a subset of the cross product, S S. Each element (S, S ) n E s an edge D D that ons nstances S, and S, where. Also each edge has a weght whch s the level of spatal nteracton, dstance or adacency. S,

6 Fgure 5: Self-on ndex representatons.(a). Neghborhood graph for relaton R.(b). Neghborhood graph for relaton R2.(c) Self-on edge lst ndex (SJELI).(d). Self-on adacency lst ndex.(sjali) Ths neghborhood graph can be represented n two dfferent ways, namely, the edge lst and the adacency lst. Fgure 5(a) and 5(b) are the neghborhood graphs for the relatons R and R2 respectvely. We present the desgn of the two representatons and evaluate the effect of the operatons on the two varants. 3.. Self Jon Index: Edge Lst Representaton (SJELI) The edge lst representaton of the self-on ndex s llustrated n Fgure 5(c). In ths representaton, the on ndex s ordered by column and wthn column by the value of the relaton R(S,S ). Ths representaton does not provde any nformaton on the successors or the predecessors of a spatal nstance S. Ths s clearly evdent from ts representaton. A clear challenge wth ths representaton s to determne an optmal parttonng of the SJELI to mnmze the I/O costs of the set of operatons Self Jon Index: Adacency Lst Representaton(SJALI) The adacency lst representaton of the self-on ndex s llustrated n Fgure 5(d).The adacency lst representaton has clear advantages compared to that of the edge lst representaton. Frst, the adacency lst representaton mantans a lst of successors and predecessors that are crtcal for processng W- Queres. Second, the colorng scheme used by the set of operatons can easly explot the adacency lst representaton to retreve the successors or predecessors wth lesser I/O. Also, processng updates on the adacency lst s easer due to the same reasons Desgn Issues We make use of the connectvty clusterng heurstc [7] to cluster the spatal nstances of the SJALI and SJELI. CCAM (Clustered Connectvty Access Method) [7] makes use of separate lsts for successors and predecessors and does not explot the concept of a spatal neghborhood. The self-on ndces, SJALI and SJELI are prmarly neghborhood graphs that are represented as adacency lsts and edge lsts. We apply the connectvty clusterng heurstc for the two neghborhood graphs to store them nto dsk pages. In the desgn of the SJALI, we mantan only one lst of adacent neghbors of a partcular spatal nstance. The proposed set of W-Query operatons, for example, getneghbors-n-relatonshp(s,r), makes use of a colorng heurstc to retreve the successors and the predecessors of a partcular spatal nstance. To allocate these spatal nstances to dsk pages,we make use of the same connectvty clusterng heurstc on the neghborhood graph. For example, n Fgure 4(d), a typcal page allocaton would nvolve storng N,, and n the same page;,, and n another page; and n a separate page. Ths allocaton scheme changes wth the maxmum sze of a page and the value of the Connectvty Resdue Rato (CRR) [7]. CRR s defned as the probablty that two neghborng spatal nstances are present n the same dsk page. Utlzng the same heurstc on the SJELI nvolves storng the edge lsts of spatal nstances n the same dsk page such that the number of cut edges s mnmzed. Ths allocates the edge lsts of spatal nstances to pages where each edge of the spatal nstance corresponds to a page entry. In some cases for large neghborhood szes, t s possble that the edge lst of one spatal nstance tself may exceed one sngle page. For example, n Fgure 5(c), a typcal page allocaton would nvolve allocatng the edge lsts of N,, and to the same page, edge lsts of,, and to another page, and to a separate page.

7 The key trade-off n the two dfferent representatons s n the value of the connectvty resdue rato (CRR) they yeld.. The SJELI would yeld a lower value of CRR for small page szes, thus resultng n larger I/O costs. SJELI would also ncur more I/O costs for larger neghborhood szes than the other representaton. Ths clearly ndcates that the value of the CRR n the case of both the SJELI and the SJALI depends on the value of the neghborhood relaton R. An n-depth evaluaton of the varaton n CRR for the two self-on ndces s beyond the scope of ths paper. 4. W-Query Processng Algorthms In ths secton, we propose two query processng algorthms usng the set of operatons get-neghbors-n-relatonshp(), getsuccessors(), get-predecessors(), get-successor(), getpredecessor(), get-predecessor-of-successor(), get-predecessorsof-successor(), update-average-edge-weght(), and updatesuccessors(). These operatons are used to desgn the algorthms for W-Queres, namely Rpley's K- Functon computaton and dentfcaton of hotspots. 4. Rpley's K Functon Computaton The Rpley K Functon computaton nvolves the use of two operatons, get-neghbors-n-relatonshp(s,r) and getsuccessors(s ). Algorthm lsts the computatonal process for the Rpley K Functon. The trace of the algorthm s lsted n Table 2. Algorthm : CalcRpleyK: Computaton process for computng Rpley s K Functon Inputs: Spatal sataset S D, Query: Is data spatally clustered?, Total number of levels, Study Area Output: K Functon: Measure of spatal randomness. Procedure: CalcRpleyK. do 2. begn 3. for every spatal nstance S n SD 4. get-neghbors-n-relatonshp(s,r[]) 5. F[] := F[]+sze(get-successors(S,R[])) 6. update-successors(s) 7. endfor 8. K [] := calculate_rpley_k from F[] 9. := +. R [] := decrease_neghborhood(r[-]). end 2. Whle(<= Total Number of Levels) The trace of the Hotspot_JI Algorthm s lsted n Table 3. The trace clearly shows that the number of hotspots computed decreases as the sze of the neghborhood ncreases. Also, the effect of the set of operatons s lsted n the trace. Table 2: Trace of CalcRpleyK Algorthm Neghbor get-neghbors-nrelatonshp(s, get-successors(s) Relaton R) R2 R :[,N,,] [,N,,] Frequency N:[,,,,] [,,,,] 5 :[,N,,] :[,,,N] [,N,,] [,,,N] 4 4 :[,,,N, ] [,,,N, ] 5 :[,,,N] :[,,] [,,,N] [,,] 4 3 Total = 28 :[,N,,] [,N,,] 4 N:[,,,] [,,,] 4 :[,N] [,N] 2 :[,] [,] 2 :[,,,N] [,,,N] 4 :[,, ] [,, ] 3 :[ ] [ ] Total = Identfcaton of Hot Spots The dentfcaton of hotspots nvolves the use of the operatons get-neghbors-n-relatonshp(s,r), getsuccessors(s,r), get-successor(s), update-successors(s), getpredecessors(s), and update-average-edge-weght(s). Algorthm 2, Hotspot_JI lsts the computatonal process for the dentfcaton of hotspots. Algorthm 2: Hotspot_JI: Computaton process for extractng hotspots from a spatal dataset. Inputs: Spatal Dataset S D, Query: Are there concentratons of crme that warrant ncreased polce targetng at the block,cty and county level? HotspotSzeThreshold, Set of Neghbor Relatons Output: Set of hotspots correspondng to each neghbor relaton Procedure: Hotspot_JI. Whle ( Sze(HotspotQueue >= HotspotSzeThreshold ) 2. begn 3. whle( Termnate when there are no more seeds) 4. S := Retreve New Seed 5. get-neghbors-n-relatonshp(s,r) 6. Successor_Lst:= get-successors(s) 7. whle(r[](predecessor-of-successor(s))<r[](get-successor(s)) 8. upd_succ_lst.enque( Successor_Lst.Deque()) 9. endwhle. update-successors(s,upd_succ_lst). HotspotQueue:= Successors_Lst 2. whle(successor_lst!=null) 3. p:=get-predecessor(successor_lst.deque()) 4. update-average-edge-weght(p) 5. endwhle 6. := + 7. R[] := ncrease_neghborhood R[-] 8. end 4

8 Table 3: Trace of Hotspot_JI Algorthm for dentfyng Hotspots from the sample dataset. Neghbor Relaton Hotspots Seeds get-successors (S) get-successor(s) get-predecessorof-successor(s) updatesuccessors getpredecessors(s) update-averageedge-weght R :[,N,,] [,N,,],,N,,, :[,N],,, :[,,,N] [,], :[] :[,,,N] [] - - :[] - - R2 :[,,,N, ] [,,,N, ],N,,, Null, Null, Null,, Null :[,,,N,],,, 4.3 Algebrac Cost Model In ths secton, we provde algebrac cost models for the I/O costs of W-Query operatons. We make use of the CRR to measure the worst case I/O costs of the operatons. Table 4 lsts the symbols used to develop the cost formulas. Table 4: Symbols used n Cost Analyss. Symbol S Meanng Average number of successors of a partcular node P Average number of predecessors of a partcular node. CRR Connectvty resdue rato : The probablty that the page(s ) = page( S ) for edge(s, S ) S R s the average number of nstances satsfyng the Neghbor Relaton R S D s the total sze of the spatal dataset. Ρ Z LI = Z Z EL = Z selectvty of a Range Query for a neghbor relaton, R, { S R /( S D -)}X S D Cost of accessng a sngle spatal nstance from the SJALI Cost of accessng a sngle spatal nstance from the SJELI For both self-on ndex varants, let the costs of retrevng one spatal nstance be Z. The value of Z s equal to, whch s the cost of a smple look-up from the on ndces. As descrbed earler, the CRR of SJELI s expected to be lower as compared to SJALI due to the presence of a large number of cut edges on a sngle page. Hence, the I/O costs of the W-Query operatons are expected to be greater for SJELI. The get-neghbors-n-relatonshp(s,r) operaton retreves all the nstances that satsfy the neghborhood relatonshp R wth S. The cost of one get-neghbors-n-relatonshp operaton equals the product of the cost of retrevng the neghbors of S multpled by the probablty that the neghbors are not n the same dsk page. The get-successors(s.) operaton retreves all the successors of S. The cost of one get-successors() operaton nvolves the cost of retrevng all the successors and the probablty that the successors are not n the same page as S. The get-predecessors(s ) operaton retreves all the predecessors of S. The cost of one get-predecessors() operaton nvolves the cost of retrevng all the predecessors of S and the probablty that they are not n the same page as S. The cost of one get-successor(s ) operaton s the probablty that the successor of S s not n the same page as S. The cost of one getpredecessor(s ) operaton s also the same. The cost of one get-predecessors-of-successor(s ) operaton nvolves the cost of extractng one successor and then the cost of extractng the predecessors of that successor, accountng for the probablty that they are not n the same dsk page. The cost of one update-successors(s ) operaton s the cost of un-colorng the successors of S whch s the cost of retrevng the successors multpled the probablty that they are not n the same page. The cost of one update-average-weght(s ) operaton s the cost of retrevng S and also movng S to an approprate secondary memory bucket whch mantans potental seeds for handlng W- Queres such as dentfcaton of hotspots. These costs are summarzed n Table 5. Table 5. Worst case I/O cost analyss of W-Query operatons. Operaton get-neghbors-nrelatonshp(s,r) get-successors(s ) get-successor(s ) get-predecessor-ofsuccessor(s ) update-successors(s ) get-predecessors(s ) get-predecessors-ofsuccessor(s ) get-predecessor(s ) update-average-edgeweght( S ) Data Page Accesses { S R /( S D -)} S D Z (-CRR) = ρ Z S D (-CRR) S Z (-CRR) Z (-CRR) 2 Z (-CRR) Z (-CRR)X S P Z (-CRR) ( P Z + ) (-CRR) Z (-CRR) 2 Z 5. Expermental Evaluaton The self-on ndces were evaluated usng a set of experments that measure the response tme of the two queres, namely Rpley s K Functon and hotspots. The experments were mplemented n C++/CLI and conducted on a Pentum Xeon 3.2 GHz Machne wth a 4GB man memory. We make use of real crme datasets to demonstrate the utlty of the self-on ndex varants to process W-Queres and ther set of operatons effcently. We measured the user response tme for the queres.

9 We compared our proposed self-on ndex-based drect on computaton method wth an R-Tree-based tree matchng self-on computaton method that computes the W-Matrx for every new neghborhood relatonshp. We performed experments for dfferent dataset szes rangng from 82 spatal nstances to 4852 spatal nstances. We also compared the response tme of the self-on ndex based algorthms wth that of the ones mplemented n a modularzed sngle threaded verson of CrmeStat. The expermental evaluaton addresses the followng questons: Queston : What s the user response tme of the Rpley K Functon Query? We mplemented the W-Query processng algorthm CalcRpleyK, proposed n Secton 4, on a self-on adacency lst ndex (SJALI). We also mplemented the same queres by repeated computaton of self-ons on the R-Tree ndex. Fgure 6 shows the comparson of the R-Tree-based on-the-fly on computaton method and the method usng the self-on ndex. The total response tme also ncludes the tme for performng I/O. It can be concluded from Fgure 6 that the self-on ndex-based mplementaton gves a better performance as compared to the R- Tree-based on-the-fly on computaton. We have omtted the detals of the algorthm for space consderatons. Ths algorthm nvolves a repeated computaton of only the self-on operaton. The algorthm was executed for neghborhood relatonshps. We mplemented the W-Query processng algorthm for hotspot Identfcaton, Hotspot_JI, on the SJALI. The user response tme of the hotspot dentfcaton process was compared wth the Tree matchng self-on algorthm usng the R Tree Fgure 7 shows the comparson of the self-on ndex based method wth the R-Tree-based method. The total response tme also ncludes the tme taken for performng I/O. It was observed that the self-on ndex-based hotspot dentfcaton method takes more response tme because of the seed selecton process that ncurs more updates on the average edge weght of the spatal nstances. However, the self-on ndex outperforms the R-Treebased on-the-fly on computaton, whch has processng overheads for removng false postves from dentfed hotspots. Fgure 6.User-response tme comparson for Rpley's K Computaton Table 6 shows the comparson wth a sngle threaded verson of CrmeStat where the self-on ndex speeds up the query processng tme by a factor of 4 for the computaton of Rpley's K functon. Table 6. User response tme comparson wth CrmeStat Datase t Sze User response tme for CrmeStat (seconds) User response tme for self-on ndex (seconds) Queston 2: What s the user response tme of the hotspot dentfcaton query? Fgure 7. User-response tme comparson for hotspot dentfcaton Table 7 shows the user response tme of the self-on ndex based algorthms wth a sngle threaded CrmeStat. As can be seen, the self-on ndex mproves the user response tme by a factor of 5 for the dentfcaton of hotspots Table 7. User response tme comparson wth CrmeStat. Datase t Sze User Response tme for CrmeStat (seconds) User response tme for self-on ndex (seconds) Conclusons and Future Work We characterzed the computatonal structure of a class of spatal statstcal queres called W-Queres. We defned a set of operatons that can be used to process these queres. These operatons have been dentfed as a basc set that s requred to process two smple W-Queres such as Rpley's K and hotspots. Table lsts other types of W-Queres that are frequently observed n spatal analyss and dentfes the two smple W- Queres as the most representatve queres. Ths paper does not clam about the completeness of the set of operatons. We defned the spatal ndex type selecton problem for selectng a sutable spatal ndex type for handlng these operatons effcently. We proposed two varants of the self-on ndex and presented our desgn decsons. We proposed algorthms for two smple W- Queres. We presented an algebrac

10 cost model for the proposed set of operatons. We performed expermental evaluaton on real crme datasets to demonstrate that the self-on ndex guarantees better user response tme as compared to an R-Tree-based on-the-fly self-on computaton and a repettve W-Matrx computaton-based CrmeStat. These observatons establsh the utlty of the on ndex to process W- Queres effcently and we have dentfed a sutable representaton of the on ndex to acheve ths obectve. Ths result valdates our clam that the self-on ndex should be supported by SDBMS for processng such queres. In future work, we plan to evaluate the detaled I/O costs of the W-Query processng algorthms for the proposed varants of the self-on ndex. We also plan to address crtcal ssues such as concurrency control and recovery, optmal query processng strateges, and extracton of optmal page access sequences for the proposed self-on ndex varants. We also want to consder more spatal statstcal queres such as the Local Moran Index, Moran's I, Geary's C, as well as other hotspot algorthms. Acknowledgments The authors would lke to thank the members of the spatal database research group at the Unversty of Mnnesota for helpful dscussons and comments. We would lke to thank Km Koffolt for her comments to mprove the readablty of the paper. Ths work was supported by grants from NSF : CN-S-7864, IIS- 7324, USDOD and NIJ: As an unrestrcted gft from Ned Levne and Assocates. 7. REFERENCES [] N. Beckmann, H.P. Kregel, R. Schneder and BB. Seeger. The R*-Tree: an effcent and robust access method for ponts and rectangles. SIGMOD Rec., 9(2): , 99. [2] N.A. Cresse, edtor. Statstcs for Spatal Data. Wley- Interscence, 993. [3] V. Gaede and O. Gunther. Multdmensonal access methods. ACM Comput. Surv., 3(2): 7-23, 998 [4] A Guttman. R Trees: a dynamc ndex structure for spatal searchng. In SIGMOD 84: Proceedngs of the 984 ACM SIGMOD nternatonal conference on Management f data, pages 47-57, New York, NY, USA. 984.ACM [5] E.H. Jacox and H.Samet. Spatal Jon Technques. ACM Transactons on Database Systems., 32(): 7, 27. [6] N. Levne, CrmeStat: A spatal statstcs program for the analyss of Crme ncdent locatons, verson 3.. Ned Levne and Assocates: Houston, TX/ Natonal Insttute of Justce: Washngton, DC, 24. URL: [7] G. Malcom. Mcrosoft SQL Server 28, Delverng Locaton Intellgence wth Spatal Data. SQL Server Techncal Artcle. Mcrosoft Corporaton, Aug 27. Avalable onlne at d69b-4f9-bc9e-468b65aaa7/spataldata.doc [8] A. Mtchell, edtor. The ESRI Gude to GIS Analyss, Volume : Geographc Patterns and Relatonshps. ESRI Press, 25. [9] A. Mtchell, edtor. The ESRI Gude to GIS Analyss, Volume 2:Statstcal Measurements and Statstcs. ESRI Press, 25. [] D. Rotem. Spatal Jon Indces. In Proceedngs of the Seventh Internatonal Conference on Data Engneerng, Aprl 8-2, 99, Kobe Japan, pages IEEE Computer Socety, 99. [] H. Samet. The quadtree and related herarchcal data structures. ACM Comput. Surv., 6(2): 87-26, 984. [2] T.K. Sells, N. Roussopoulos and C.Faloutsos. The R+-Tree: A dynamc ndex for mult-dmensonal obects. In VLDB 87: Proceedngs of the 3 th Internatonal Conference on Very large databases, pages 57-58, San Francsco, CA, USA, 987. Morgan Kaufman Publshers Inc. [3] S. Shekhar and S.Chawla, edtors. Spatal Databases: A Tour. Prentce Hall, 22. [4] B.D. Rpley. The second-order analyss of statonary pont processes. Journal of Appled Probablty 3: [5] S.Shekhar, C.T. Lu, S.Chawla and S.Ravada. Effcent Jon- Index- Based Spatal Jon Processng: A Clusterng Approach. IEEE Trans. In Know. and Data Engneerng 5(), 23. [6] Oracle Spatal g: Advanced Spatal Data Management for the Enterprse. Oracle Data Sheet. Feb 25. Avalable onlne at collateral/spatalg_datasheet.pdf [7] S. Shekhar and D. R. Lu, CCAM: A Connectvty-Clustered Access Method for Networks and Network Computatons, IEEE Trans. on Knowledge and Data Engneerng, Vol. 9, No., Jan. 997 [8] M. Worboys and M. Duckham, edtors. GIS: A Computng Perspectve. Second Edton. CRC, 24. [9] IBM Informx Spatal DataBlade Module: User's Gude. IBM Corporaton, Ver 8.2, Part No.-99, Aug: 22. Avalable onlne at

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a