Summarization and Matching of Density-Based Clusters in Streaming Environments

Size: px

Start display at page:

Download "Summarization and Matching of Density-Based Clusters in Streaming Environments"

Caroline Small
6 years ago
Views:

1 Summarzaton and Matchng of Densty-Based Clusters n Streamng Envronments D Yang Oracle Corporaton 1 Oracle Drve Nashua, NH, USA d.yang@oracle.com Elke A. Rundenstener Worcester Polytechnc Insttute 100 Insttute Road Worcester, MA, USA rundenst@cs.wp.edu Matthew O. Ward Worcester Polytechnc Insttute 100 Insttute Road Worcester, MA, USA matt@cs.wp.edu ABSTRACT Densty-based cluster mnng s known to serve a broad range of applcatons rangng from stock trade analyss to movng object montorng. Although methods for effcent extracton of densty-based clusters have been studed n the lterature, the problem of summarzng and matchng of such clusters wth arbtrary shapes and complex cluster structures remans unsolved. Therefore, the goal of our work s to extend the state-of-art of densty-based cluster mnng n streams from cluster extracton only to now also support analyss and management of the extracted clusters. Our work solves three major techncal challenges. Frst, we propose a novel mult-resoluton cluster summarzaton method, called Skeletal Grd Summarzaton (SGS), whch captures the key features of densty-based clusters, coverng both ther external shape and nternal cluster structures. Second, n order to summarze the extracted clusters n real-tme, we present an ntegrated computaton strategy C-SGS, whch pggybacks the generaton of cluster summarzatons wthn the onlne clusterng process. Lastly, we desgn a mechansm to effcently execute cluster matchng queres, whch dentfy smlar clusters for gven cluster of analyst s nterest from clusters extracted earler n the stream hstory. Our expermental study usng real streamng data shows the clear superorty of our proposed methods n both effcency and effectveness for cluster summarzaton and cluster matchng queres to other potental alternatves. 1. INTRODUCTION Motvaton. Mnng complex patterns such as clusters and graphs from huge volumes of streamng data has been recognzed as crtcal for numerous applcaton domans. To facltate such complex pattern mnng process, a streamng Ths work s supported by the NSF, under grants CCF , IIS and IIS Ths work s done when the author s workng at WPI. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. Artcles from ths volume were nvted to present ther results at The 38th Internatonal Conference on Very Large Data Bases, August 27th - 31st 2012, Istanbul, Turkey. Proceedngs of the VLDB Endowment, Vol. 5, No. 2 Copyrght 2011 VLDB Endowment /11/10... $ pattern mnng system does not only need to be equpped wth hghly effcent pattern extracton algorthms, but more mportantly, t must also provde effectve pattern analyss support, as motvated below: 1) Pattern feature abstracton. The key features of detected patterns may be complex and thus may not be easly comprehensble for human analysts wthout analytcal assstance. For example, n real-tme traffc montorng, a cluster representng a congeston area n the traffc of Bejng may be composed of 10K or even more vehcles and may spread to over 10km 2. By smply lookng at the nformaton about ndvdual cluster members (vehcles), such as ther postons and movng speed, an analyst may not be able to dentfy the key features of ths cluster n real tme, such as where s the key bottleneck causng the congeston. 2) Pattern compresson. Some patterns need to be kept for long-term analyss, yet keepng the full representaton of the complex patterns tends to be mpractcal n streamng envronments. In the prevous example, storng the full representaton of the detected traffc congeston patterns (arbtrarly shaped clusters), namely the ndvdual cluster member tuples (tens of thousands tuples for each cluster) would cause not only a huge burden on the storage space but also low effcency for pattern transmsson. 3) Pattern retreval (matchng). For stream analyss, the archved patterns may need to be retreved based on ther features. Usng the above example, when a new traffc congeston arses, the analysts may ask whether smlar congeston patterns have been detected before. If yes, rather than fgurng out a new congeston-relef plan from scratch, the prevous proven-to-work soluton for such congeston patterns could be drectly appled. In short, an effectve pattern summarzaton method s the key for complex pattern analyss and management. It s needed for many dfferent aspects of pattern analyss, ncludng feature abstracton, compresson and pattern retreval (as mentoned above). Also, the pattern summarzatons can also be used for approxmated pattern representaton. For example, one can desgn pattern vsualzaton or full representaton re-generaton technques based on pattern summarzatons. In ths work, our goal s to desgn effectve summarzaton and matchng technques for densty-based clusters n streamng envronments, whch reman open problems for database communty. Sldng Wndow Semantcs. In ths work, we focus on densty-based cluster mnng n sldng stream wndows [7, 8, 16, 17]. In ths query semantcs, arbtrarly shaped 121

2 clusters are contnuously detected wthn the most recent porton of the stream. The traffc congeston montorng task dscussed above s an example that requres such query semantcs. Other applcatons that requre such query semantcs nclude detectng ntensve-transacton areas (clusters) n most recent stock trades, and dentfyng malcous attacks (clusters) n current network traffc. Challenges. Summarzaton and matchng of denstybased clusters s not only an unsolved but also a challengng problem. To serve real-tme streamng applcatons, the proposed technques must address the followng challenges: 1) Cluster summarzaton must be suffcently descrptve yet hghly compact. The cluster structure of a densty-based cluster s defned by a seres of densely populated sub-regons and as well as the connectons among them (See Fgure 1). Clearly, smple statstcal aggregatons, such as the centrod or mnmum boundng rectangle of a cluster, are nsuffcent for descrbng such complex pattern structure. 2) The cluster summarzaton process has to be hghly effcent. A system conductng expensve onlne clusterng can hardly afford addtonal system resources for summarzng clusters n real-tme. 3) The summarzed cluster representaton needs to be effectvely retrevable ( matchable ). The matchng process between cluster summarzatons ought to loyally reflect the smlarty between the orgnal clusters, yet be computatonally effcent. Proposed Soluton. To address the above challenges, we frst analyze densty-based cluster structures and dentfy ther key characterstcs, namely poston, shape, connectvty and densty dstrbuton. To capture these features, we nvestgate two commonly-used summarzaton prncples, namely the graph-based and the grd-based strateges, We dscover that nether of them alone s capable to provde an effectve summarzaton for densty-based clusters. Therefore, we propose a hybrd soluton, called Skeletal Grd Summarzaton (SGS). For descrptve power, SGS s shown to guarantee ts fdelty to the orgnal clusters on all key features. For compactness, our expermental study n Secton 8 confrms that even the SGS of the hghest resoluton acheves on average a 98% compresson rate of the full representaton of the clusters. Empowered by the proposed SGS summarzaton, we desgn a framework to support both contnuous cluster extracton and cluster matchng queres. A contnuous cluster extracton query n our system does not only extract clusters n ther full representaton (all cluster member objects) for onlne montorng purposes lke the other state-of-the-art technques [3, 16], but t also concurrently compacts them nto the SGS summarzaton. The full and the summarzed (SGS) representaton formats are complementary to each other, provdng a descrpton of the clusters at the ndvdual tuple and cluster feature level respectvely. To extract these two representaton formats smultaneously and n a hghly effcent manner, we propose an ntegrated cluster extracton + summarzaton algorthm, C-SGS. C-SGS ncrementally mantans both the full representaton and the correspondng SGS of the extracted clusters n an ntegrated manner. Ths results n an almost free cluster summarzaton generaton by pggy-packng the summarzaton process nto the cluster extracton process tself. Our expermental study n Secton 8 shows that C-SGS, whch returns clusters n both full and summarzed representaton (SGS), has a neglectable overhead, compared wth state-of-the-art algorthm Extra-N [16] computng the full representaton of clusters only. In all our test cases, the extra response tme of C-SGS compared wth Extra-N s consstently less than 6% (Secton 8.1). For any to-be-matched cluster specfed by the analyst, a cluster matchng query dentfes smlar clusters extracted earler n the same stream from a pattern archve. To support such queres, our framework frst archves the SGS of the extracted clusters nto a pattern archve. When executng a cluster matchng query, our system deploys a flterand-refne strategy. Frst, the flter-phase explots a feature ndex to locate the potental matchng canddates from the pattern store. Then, the refne-phase conducts a more detaled cluster match aganst these promsng canddates and returns those wth smlarty above a gven threshold. Our expermental study shows that, effcency-wse, our system takes only 3 seconds on average to answer a cluster matchng query aganst 10K archved clusters (Secton 8.2). Qualty-wse, our user study, whch nvtes human analysts to vsually compare the smlarty between matched clusters, shows that human analysts agree wth a sgnfcant larger percentage of the matched clusters found usng our proposed matchng mechansm compared to those found by alternatves (Secton 8.3). Contrbutons. The man contrbutons of ths work nclude: 1) We propose the frst summarzaton method specfcally desgned for densty-based clusters, namely the Skeletal Grd Summarzaton (SGS), 2) We present an ntegrated cluster mnng and summarzaton algorthm, C-SGS, whch effcently computes the full representaton and the SGS of the extracted clusters n one shot. 3) We develop a cluster matchng mechansm based on SGS to effcently processng cluster matchng queres n real-tme. 4) Our performance evaluaton and user study usng real streamng data confrm that our proposed technques are clearly superor to other alternatves n all aspects, ncludng summarzaton effcency, cluster matchng effcency and matchng qualty. 2. RELATED WORK The concept of densty-based clusterng was frst proposed n [8]. It has drawn sgnfcant research attenton [7, 16, 17, 12, 3, 4], because of ts capablty of dentfyng clusters wth arbtrary shapes and specfed densty. Prevous work manly studed how to effcently extract such clusters n statc [8, 7, 12] or streamng envronments [16, 17, 3, 4]. Also, gven the prevalence of real-tme montorng tasks n stream applcatons, researchers have started to desgn vsual platforms allowng human analysts to nteractvely explore such patterns n streams [14]. However, the fundamental problem of summarzng ths mportant pattern type has not been studed n the lterature yet. Wthout an effectve yet compact summarzaton method, each densty-based cluster has to be expressed by ts full representaton, namely ts cluster member objects. Obvously, such full representaton s nether succnct nor does t explctly reflect the features of each cluster. Ths causes serous nconvenence for both storage and analyss of densty-based clusters. Tradtonal clusterng methods [10, 19], such as k-mean style clusterng, treat clusters as statstcal phenomena. Therefore, many key features of the clusters, such as ther shapes and denstes, are summarzed usng a rather smplstc descrpton. In partcular, frst, these works assume clus- 122

3 ters are sphercally shaped. Therefore, the shape of a cluster s usually descrbed usng a smple centrod + radus formula. Second, the prevous work do not capture the nternal features of the clusters, such as how ts densty s dstrbuted. For example, the densty of a cluster s ether treated as unform or varyng along the radus only. Obvously, such smple formula cannot well descrbe the complex cluster structure of densty-based clusters. Ths s because both the shapes and densty dstrbutons of densty-based clusters can be arbtrary, not to menton the complex subregon connectvtes n each cluster. To the best of our knowledge, no summarzaton method has been specfcally desgned for densty-based clusters. For computng cluster summarzatons n streamng envronments, f the clusters are treated as statstcal phenomena, they are consdered to be aggregatable over tme [1, 5]. For example, [1] used one Cluster Feature Vector (CFV) to represent each mcro-cluster detected n the stream. They rely on the addtvty property of the CFV to aggregate the cluster features over tme and compare the features of a same cluster at dfferent tme ponts by subtractng ts CFVs on the correspondng tme ponts. However, the complex cluster structure of densty-based clusters s not smply aggregatable over the sldng wndows. The contnuous expraton of old objects and arrval of new objects at each wndow may cause complex cluster structural changes, such as merge and splt and connectvty changes wthn the clusters. Clearly, these changes cannot be smply captured by aggregaton results. Thus, these technques cannot effectvely capture the features of densty-based clusters wthn sldng wndows. 3. PRELIMINARIES 3.1 Densty-Based Clusterng n Wndows Densty-based cluster detecton [8, 7] uses a range threshold θ r 0 to defne the neghbor relatonshp between objects. For two objects p and p j, f the dstance between them s no larger than θ r, p and p j are sad to be neghbors. We use the functon NumNegh(p,θ r ) to denote the number of neghbors a object p has, gven the θ r threshold. Defnton 3.1. Densty-Based Cluster: Gven θ r and a count threshold θ c, an object p wth NumNegh(p,θ r ) θ c s defned as a core pont. Otherwse, f p s a neghbor of any core object, p s an edge pont. p s a nose pont f t s nether a core object nor an edge object. Two core objects p 0 and p n are connected, f they are neghbors of each other, or there exsts a sequence of core ponts p 0,p 1,...p n 1,p n, where for any wth 0 n 1, each par of core ponts p and p +1 are neghbors of each other. Fnally, a denstybased cluster s defned as a maxmum group of connected core objects and the edge objects attached to them. Any par of core objects wthn a cluster are connected. Fgure 1 shows an example of a densty-based cluster composed of 11 core objects (black) and 24 edge ponts (grey). We focus on perodc sldng wndow semantcs as proposed n CQL [2] and wdely used n the lterature [16, 17]. These proposed semantcs can be ether tme- or countbased. Each query has a wndow wth a fxed wndow sze wn and a fxed slde sze slde (ether a tme nterval or a tuple count). Clusters are generated for each wndow W Fgure 1: Defnton of Densty-Based Clusters only based on those data ponts that fall nto the same wndow W. Each cluster s returned as all ts cluster member objects assocated wth the same cluster dentfcaton. We call ths typcal output format the full representaton of each cluster. 3.2 Supported Queres and System Overvew Our system support two types of analytcal queres: Contnuous Clusterng Queres. A Contnuous Custerng Query returns both full (Secton 3.1) and summarzed representaton of the extracted clusters (Fgure 2). The desgn of our proposed cluster summarzaton format wll be ntroduced n Secton 4. DETECT DenstyBasedClusters f+s FROM stream USING θ range = r and θ cnt = c IN Wndows WITH wn = w and slde = s Fgure 2: Contnuous Cluster Extracton Query Returnng full (f) and summarzed (s) representatons of clusters Cluster Matchng Queres. Gven a user specfed to-be-matched cluster C, a cluster matchng query fnds clusters smlar to C that resde n the hstorcal pattern archve. We show a template of such a query n Fgure 3. GIVEN DenstyBasedCluster s C SELECT DenstyBasedCluster s C j FROM Hstory WHERE Dstance(C,C j) sm threshold Fgure 3: Cluster Matchng Query fndng Clusters Smlar to To-Be-Matched Cluster Based on Cluster Summarzaton The to-be-matched cluster can be any cluster specfed by an analyst. Typcally, t may be a cluster detected n the most recent porton of the stream that represent the newest characterstcs of the stream. The matched clusters, f any, wll be found n the hstorcal pattern store, whch archves the clusters extracted by Contnuous Clusterng Query earler n the stream. 3.3 System Overvew To support these two types of analytcal queres, we desgn a framework composed of four major components(fgure 4). Here we gve a bref overvew of the functonaltes of each 123

4 component, whle n-depth techncal detals are dscussed later n Sectons 5 to 7. Fgure 4: System Overvew The Pattern Extractor executes the Contnuous Cluster Extracton Query (Fgure 2) aganst the nput stream. It outputs both full and summarzed representatons of the extracted clusters. Both representatons are returned to the analyst for real-tme montorng. Meanwhle, the extracted clusters are also passed to the Pattern Archver for storage, and to Pattern Analyzer for cluster matchng. The Pattern Archver selectvely archves the newly detected clusters nto the Pattern Base. These archved clusters consttute the Stream Hstory avalable for subsequent Cluster Matchng Queres (Fgure 3). The Pattern Archver controls whch extracted clusters should be kept n the Pattern Base and at whch resoluton they should be archved. The Pattern Base organzes the archved clusters. To facltate cluster matchng aganst hstorcal clusters, t employs multple feature ndces to organze the archved clusters. Ths helps the Cluster Matchng Queres to quckly locate the potental matchng canddates. The Pattern Analyzer executes the Cluster Matchng Queres (Fgure 3). If an analyst s nterested n any newly extracted cluster and would lke to learn whether smlar clusters had been detected before n the Stream Hstory, she can submt her Cluster Matchng Query to the Pattern Analyzer to search for matches aganst the Pattern Base. 4. CLUSTER SUMMARIZATION 4.1 Features of Densty-Based Clusters Based on our analyss, we dentfy four key features that defne each densty-based cluster, whch can be dvded nto two categores, namely external and nternal features. External Features: Locaton: The locaton of a cluster ndcates ts poston n the data space. It provdes basc nformaton about each cluster, such as where a congeston area (a cluster) arses n the traffc, or n whch prce range an ntensve-transacton area, a cluster based on prce, volume and transacton tme, s detected n the stock transacton stream. Shape: Densty-based clusters can have arbtrary shapes. The shape s a key feature, because a certan shape of the cluster may convey specfc meanng for an applcaton. For example, for the clusters representng ntensve-transacton areas n stock transactons, a cluster havng a long spread on transacton prce but short range on transacton tme conveys that a large number of transactons of a certan stock happened n a short tme perod whle the prce of t fluctuated dramatcally wthn ths tme perod. Internal Features: Connectvty: The connectvty of a densty-based cluster descrbes how sub-regons wthn the cluster are connected. It s mportant for densty-based clusters for both defnton and applcaton reasons. Frst, t defnes nternal structure of each cluster. The defnton of the densty-based cluster (see Secton 3.1) reles on the connectvtes among sub-regons to defne a cluster. Second, the connectves among sub-regons may be relevant to applcatons. For example, f two sub-regons wthn a sngle cluster representng a group of movng troops are not drectly connected, then ths may ndcate the unts n these two sub-regons cannot drectly communcate wth each other, because there are no connected Head Nodes (core objects) n these two subregons of ther wreless network. Densty Dstrbuton: Although the defnton of denstybased clusters mposes a mnmal densty requrement on objects n acluster, the denstyof each cluster can be rather dverse across ts sub-regons. The densty dstrbuton wthn each cluster may be of an analyst s nterest n many applcatons. Usng the earler example, even n a sngle congeston area, the level of congeston (densty of vehcles) may vary among sub-regons. Therefore, the densty dstrbuton n each sub-regon may be the key for workng out a congeston relef plan, as the super dense sub-regons may be the areas that cause the congeston. 4.2 Intal Effort: Graph-Based Summarzaton Method Any effectve summarzed representaton for densty-based clusters has to capture the above four key features (Secton 4.1). Gven that densty-based clusters may vary arbtrarly n shape, connectvty and also densty dstrbutons, usng any aggregatve method to represent these features wll have rather poor descrptve power. Therefore, we propose to leverage an alternatve strategy, namely the dvde-andconquer approach. We dvde each cluster nto sub-regons, and then we descrbe not only the features n each sub-regon but also the nterrelatonshps among the sub-regons. Gven ths dvde-and-conquer strategy, we frst ntroduce a possble summarzaton method based on graph theory. Ths method uses one representatve object to represent each sub-regon. We call t Skeletal Pont Summaton (SkPS): Defnton 4.1. For each cluster C, the SkPS summarzaton of C s a graph G(V,E) composed of a mnmal set of connected core objects of C, called Skeletal Ponts as vertces V, whose neghborhoods together cover all the objects n ths cluster, and connectons among them as edges E. Thegraphcomposedofallcore objects nfgure1sanexample for SkPS. SkPS captures most of the cluster features and also has good compactness. However, t suffers from several serous shortcomngs. Frst, SkPS has lmted descrptve power for a cluster s densty dstrbuton. Second, such SkPS s not effcently computable. For each cluster, dentfyng ts SkPS s equal to the problem of dentfyng the connected domnant set n an undrected graph whch has been proven to be NP-complete [9]. Thrd, SkPS s not a vable soluton for matchng, because a sngle cluster may have multple SkPSs wth rather dfferent graph structures. Based on our analyss, these lmtatons suffered by SkPS are 124

5 caused by ts overlappng and non-determnstc sub-regon dvson strategy. In concluson, SkPS does not consttute an deal summarzaton for densty-based clusters. A more detaled dscusson of SkPS method can be found n our techncal report [18]. 4.3 Proposed Soluton: Skeletal Grd Summarzaton Method Bascs of Grd-Based Summarzaton. To solve the lmtatons suffered by SkPS, we propose to adapt SkPS by dvdng each cluster nto non-overlappng sub-regons. In partcular, we dvde the whole data space nto unformly szed grd cells. For each cluster, ts sub-regon dvson s now determned by the grd cells nto whch ts members fall. Therefore, a cluster C can be represented by all the grd cells contanng at least one of C s cluster member objects. Connectvty Preservaton. However, ths smplstc grd-based summarzaton lacks one key capablty of the SkPS soluton, namely t does not capture the connectvty wthn clusters. In SkPS, both the nner and nter sub-regon connectvty nformaton of each cluster s well preserved. Frst, each sub-regon n SkPS tself s well connected, as all objects n a sub-regon are neghbors of the same skeletal pont. Second, the nter connectons among dfferent subregons are explctly expressed by the edges n SkPS. Whle ths smplstc grd-based summarzaton preserve nether of these two types of connectvty nformaton. Connectvtes In Grd Cells. To solve ths problem, we propose to ntegrate the concept of connectvtes nto the grd-based soluton. As foundaton, we frst ntroduce the concept of status to a grd cell. We dvde the grd cells n each cluster s summarzaton nto two categores, namely core cells and edge cells. Defnton 4.2. Core cells: a core cell of a cluster C contans at least one core object (See Def. 3.1) of C. Edge cells: an edge cell of a cluster C contans no core object, but at least one edge object (See Def. 3.1) of C. Nose cells: a nose cell contans nether core nor edge objects of any cluster. 1 For nner-sub-regon connectons, we follow the basc prncple for the sub-regon dvson strategy, whch s to pursue homogenety n each sub-regon. In partcular, we pck a fne grd sze to guarantee that the objects that fall nto the same grd cell are neghbors of each other. More precsely, the dagonal of each grd s set to be equal to the range threshold θ r n the gven clusterng query (see Secton 3.1). Ths grd cell sze selecton wll be relaxed later n our dscusson of the mult-resoluton cluster summarzaton (Secton 6). Under ths fne grd sze selecton, the core and edge cells can be shown to have the followng propertes. Lemma 4.1. All objects n a core cell belong to the same cluster. Proof: Snce each core cell contans at least one core object and all the objects n each core cell are now neghbors of each other, t mples that all objects n the same core cell are neghbors of at least one common core object. Based on the defnton of densty-based cluster (see Def. 3.1), the neghbors of a core object belong to the same cluster. 1 nose grd are are only used n cluster computaton stage. Lemma 4.2. The number of objects n an edge cell must be less than the count threshold θ c n the clusterng query. Proof: We prove ths lemma by contradcton. Gven that all objects n a grd cell are neghbors of each other, f there are at least θ c objects n an edge cell, those objects would be core objects, as they all have at least θ c neghbors. Ths contradcts the defnton of edge grd (Def. 4.2). Gven these propertes, each grd cell s well-connected and consttutes a basc unt for the nter-grd connecton expresson, as defned below. For the nter-sub-regon connecton, we now defne the connectons between grd cells. Defnton 4.3. Two core cells ccl 1 and ccl 2 are drectly connected, f there exsts at least one core object p n ccl 1 and one core object p j n ccl 2 that are neghbors of each other. Two core cells ccl 0 and ccl n are connected, f they are drectly connected to each other, or there exsts a sequence of core cells ccl 0,ccl 1,...ccl n 1,ccl n, where for any wth 0 n 1, each par of core cells ccl and ccl +1 are drectly connected wth each other. An edge cell ecl s attached to a core grd ccl j, f there exsts at least one object p n ecl and one core object p j n ccl j that are neghbors of each other. Two edge cells are nether connected nor attached. Gven the connecton defnton for grd cells above, all core cells of a cluster C are connected to each other, and all edge cells are attached to at least one core cell of C. Skeletal Grd Summarzaton. Based on the status and connectons of grd cells, we now gve the defnton of our proposed Skeletal Grd Summarzaton (SGS) method. Defnton 4.4. A Skeletal Grd Summarzaton (SGS) of a densty-based cluster C s composed of all grd cells that contan at least one cluster member object of C. We call each grd cell n a SGS, a Skeletal Grd Cell (Sc) of C. SGS = {Sc 0,Sc 1,...Sc n}. Each Sc has fve attrbutes, namely SG = (locaton[], sdelength, populaton, status, connecton[]). 1) locaton vector: a sequence of values, each ndcatng the mnmum value on one of the dmensons covered by Sc. 2) sde length: the range of values on each dmenson. 3) populaton: the number of objects contaned by Sc 4) status: whether Sc s a core or edge cell. 5) connecton vector: a sequence of boolean connecton ndcators, each ndcatng Sc s connecton to one of ts adjacent skeletal grd cells. For any edge or nose cell, all connecton ndcators are false. For any core grd, a connecton ndcator s true f the correspondng adjacent skeletal grd cell Sc j s a core cell and Sc and SG j are drectly connected, or f SG j s an edge cell attached to SG. Fgure 5 shows an example of our proposed Skeletal Grd Summarzaton (SGS) for a 2D cluster. SGS acheves our goal of preservng all four features, as shown below. Lemma 4.3. Fdelty to Locaton and Shape: The data space covered by C.SGS s larger than that covered by the cluster member objects of C by a bounded error. Namely, any pont n the data space covered by C.SGS s at most θ r away from a cluster member object n C. Proof: The data space covered by C.SGS s composed of the unon of the space covered by all ts skeletal grd cells. 125

6 Observaton 5.1. The man tasks for both densty-based cluster extracton and SGS computaton are the same, namely to frst dentfy the connectons (neghborshps) among the objects and analyze them to form the cluster structures (n ether the full or a summarzed representaton). Fgure 5: Example of full representaton, basc SGS and compressed SGS of a 2D cluster Snce all member objects of C fall nto these grd cells, the data space covered by C.SGS s larger than that covered by C s member objects. Snce each skeletal grd cell n C.SGS contans at least one member of C, and the dagonal of each cell s θ r, any pont n the data space covered by a skeletal grd cell s at most θ r away from a member of C. Lemma 4.4. Fdelty to Densty Dstrbuton: For any sub-regon n a cluster C, whch s composed of n (n 1) grd cells, C.SGS can accurately express ts densty. Proof: Snce the skeletal grd cells n C.SGS don t overlap, the populaton recorded by each skeletal grd cell accurately reflects the number of objects n t. Therefore, for any sub-regon covered by the n skeletal grd cells belongng to C, we can accurately calculate ts densty by dvdng ts total populaton by ts total volume. Lemma 4.5. Fdelty to Connectvty: If there are two sub-regons n C connected through a connected core object path composed of n core objects, there must exst a core grd path connectng these two sub-regons wth at most n core cells on ths path. Proof: Snce any skeletal grd cell contanng a core object s a core cell, f there exsts a core object path between two sub-regons, there must exst a core cell path between them. In the worst case, each core grd on ths core grd path contans only one core object. Thus the length of the core grd path s at most equal to the length of the core object path. In concluson, SGS effectvely captures all key features of densty-based clusters usng a compact descrpton. 5. PATTERN EXTRACTOR Next, we ntroduce the pattern extractor that executes the Contnuous Clusterng Query (Secton 3.2), outputtng clusters n both full and summarzed (SGS) representatons. To provde such functonaltes, a straghtforward approach would be a two-stage process, namely cluster extracton followed by summarzaton. However, ths strategy causes a sgnfcant performance overhead compared to cluster extracton only. An n-depth analyss of such a two-phase strategy can be found n our techncal report [18]. 5.1 Proposed Soluton: Integrated Process To solve ths problem, we nstead propose an ntegrated strategy that ncorporates cluster extracton and summarzaton nto a sngle process. The key observaton that motvates ths ntegrated computaton method s gven below. Ths observaton reveals the key commonalty among the cluster extracton and summarzaton processes. Based on t, we desgn an ntegrated extracton+summarzaton method to effectvely share the neghborshp dentfcaton and cluster formaton processes. 5.2 Incremental Computaton and Challenges To avod conductng the prohbtvely expensve clusterng process from scratch at each wndow, our proposed method ncrementally mantans the cluster structures across the wndows. To realze ncremental computaton, we need to fnd an approprate meta-data that can be mantaned for both the full and summarzed cluster representatons. Our proposed soluton s that, besdes the raw data fallng nto each wndow, whch needs to be mantaned for cluster extracton n any case, we ncrementally mantan the skeletal grd cells n the data space. Wth updated skeletal grd cells, we can easly output both the summarzed and full representatons of detected clusters. Frst, based on connectons among the skeletal grd cells, we can easly determne the summarzed representaton SGS (a group of connected skeletal grd cells) for each cluster. Second, gven the SGS of a cluster C, C.SGS, we can fgure out the cluster member objects of C based on the objects fallng nto the respectve skeletal grd cells belongng to C.SGS. However, ncrementally mantanng skeletal grd cells n an effcent manner s a challengng task. In partcular, trackng the changes to the skeletal grd cells caused by expred objects can be extremely expensve n terms of system resource utlzaton, and thus consttutes the key performance bottleneck for skeletal grd cell mantenance. When an object p exp expres, t needs the connectons at the object level, to update the connectons among the skeletal grd cells. For example, when p exp expres, we frst need to know whch objects are neghbors of p exp, as ther neghborshps wth p exp wll end from now on. Ths may break the connectons between the skeletal grd cell Sc n whch p new resdes and those n whch p exp s neghbors resde. However, consderng the large amount of par-wse neghborshps that may exst among the objects, mantanng all of them has been shown to be extremely expensve n terms of system resource utlzaton, analytcally and expermentally [16]. Therefore, the straghtforward ncremental mantenance method, whch updates skeletal grd cells correspondng to each nserton and deleton, s not practcal. 5.3 lfespan" Analyss To solve ths computaton bottleneck, we present a skeletal grd cell mantenance method usng lfespan analyss. Ths method elegantly elmnates the need for handlng the mpact of expred objects on the skeletal grd cells. The soluton s based on the observaton that n the sldng wndow semantcs the lfespan of any object as well as the neghborshps among objects are determnstc. Therefore, at the nserton stage, when we handle the mpact of new objects on the skeletal grd cells, we take the lfespans of the objects nto consderaton. In partcular, we pre-determne the changes that wll happen to the skeletal grd cells when these 126

7 objects expre later. Then at the expraton stage, no further update s needed to handle the mpact of expred objects. Thus we avod the bottleneck dscussed above. Among the fve attrbutes of a skeletal grd cell, except locaton and sde length that are fxed over tme, the other three, namely populaton, status and connectons are changng over tme as the objects come and go wth each wndow slde. The populaton of each skeletal grd cell s easly trackable wth a smple object counter. Thus, we focus on the lfespan analyss of the status and the connectons. Bascs for lfespan Analyss. Frst, we start wth analyzng the lfespan of ndvdual objects. Observaton 5.2. Gven the slde sze Q.slde of a query Q and the startng tme of the current wndow W n.t start, the lfespan of an object p n W n wth tme stamp p.t s p.lfespan = p.t W n.t start, ndcatng that p Q.slde wll partcpate n wndows W n to W n+p.lfespan 1. The number of wndows that an object p can survve n s determned by after how many wndow sldes that p s tme stamp wll stll be greater than the startng tme of the wndow. Based on the lfespan of ndvdual objects, we analyze the lfespan of neghborshp between two objects. Observaton 5.3. Gven two objects p and p j, the neghborshp between them, Neghbor(p,p j) wll hold for Neghbor(p,p j).lfespan = Mn(p.lfespan,p j.lfespan) wndows, namely, t wll exst n all wndows from W n to W n+neghbor(p,p j ).lfespan 1 untl ether p or p j expres. Based on these observatons, we can further analyze the lfespan of dfferent stages of an object s career. Observaton 5.4. Gven an object p and all ts neghbors objects p nb1 to p nbk, the number of wndows n whch p wll be a core object p.core lfespan = Mn(p.lfespan, wn θ c ne), wth wn θ c ne the number of wndows n whch at least θ c objects wthn p nb1 to p nbk wll partcpate. The number of wndows n whch p wll be edge object p.edge lfespan = Mn[p.lfespan p.core lfespan, Max 1 j k (p nbj.core lfespan)] Bascally, an object wll be a core object n all the wndows that t has at least θ c neghbors. It wll be an edge object when t core object career ends (no longer has enough neghbors) but at least one of ts neghbors s stll a core object. lfespan at Grd Cell Level. To tackle skeletal grd cell mantenance, now we extend the concept of lfespan from the object level to the grd cell level. In partcular, we analyze how the lfespan of objects, ther neghborshps and ther career affects the lfespan of skeletal grd cell s status and connectons. For each skeletal grd cell Sc, we mantan one lfespan ndcator for Sc.status and one for each Sc.connectons[]. Each lfespan ndcates that, based on the objects n the current wndow, n how many future wndows the value of ths attrbute wll persst. These ndcators wll be updated as new objects arrve. Lemma 5.1. Status lfespan. Gven a skeletal grd cell Sc, all the objects p 0 to p n n Sc, the number of wndows n whch Sc wll be a core cell SG.core lfespan = Max 0 n (p.core lfespan) Lemma 5.1 can be deduced by defnton of a core cell (Def. 4.2). Namely, Sc s a core cell f t contans at least one core object. Lemma 5.2. Connecton lfespan. Gven two skeletal grd cells Sc and Sc j, and all objects n Sc, p sc 0 to p sc n, and all objects n Sc j, p sc j 0 to p sc j m, the number of wndows n whch Sc and Sc j wll be connected s defned as Connecton(Sc,Sc j).lfespan = Max[Mn(p sg a.core lfe span,p sg j b.core lfespan,neghbor(p sg a,p sg j b ).lfespan)], a [0,n], b [0,m]. Ths ndcates that two skeletal grd cells reman connected f at least one par of core objects, each from one skeletal grd cell, are neghbors to each other. Auxlary Meta-Data. To nsure that we only run one range query search (rqs) for each new object and never rerun rqs for exstng objects, we mantan an auxlary meta nformaton for each object n the wndow. In partcular, we mantan a non-core-career neghbor lst for each object p to store all p s neghbors n ts non core career. For example, p currently may have 100 neghbors. Based on the lfespan analyss, t wll be a core object for 3 wndows and then due to most of ts neghbors exprng, t wll become a edge object for 2 wndows before expraton. In ths case, the non-core-career neghbor lst of p only contans ts neghbors n the last 2 wndows of ts lfespan, say 5 objects. The non-core-career-neghbors of each object are mantaned n a dynamc hash table. The hash table of each object p s ntalzed to have n buckets, wth n the number of wndows that p can survve. The hash key of the table s the number of wndows that a neghbor object can survve. For example, when a data pont p fnds a non-core-careerneghbor p j, p j wll be added to the k th bucket of the hash table, wth k the number of wndows p j can stll survve (f k s larger than the number of buckets remaned on p, p j s put n the last bucket). At each wndow slde, we can smply remove the whole frst bucket of each remanng object, as all the neghbors n ths bucket must be expred after the wndow slde. The number of neghbors n such non-core-career neghbor lst s bounded by the constant θ c. Namely an object can never have more than θ c neghbors n ts non-core career, otherwse t would nstead be a core object n those wndows. Ths theoretcal bound guarantees the lghtness of ths auxlary meta-data. Also, t provdes all necessary access to the objects neghbors needed n our cluster extracton process. It thus guarantees that we only run the mnmum number of range query searches (one for each new object) durng the clusterng. 5.4 C-SGS Algorthm We call our proposed algorthm based on the mantenance of skeletal grd cells and lfespan analyss C-SGS. Intalzaton. For a contnuous clusterng query, at the ntalzaton stage, C-SGS bulds a grd-based ndex whose grd cell sze s equal to the sze of the fnest skeletal grd sze for ths query (see Secton 4). We assgn to each grd cell n ths ndex the same attrbutes as the skeletal grd cells, whle we set ther status to be nose, densty to be 0, and connectons to be all false ntally. Handlng Insertons. For each new object p new nserted nto the wndow, C-SGS frst loads t nto ts correspondng skeletal grd cell based on ts poston n the data 127

8 space. Then, we run a range query search for p new to dentfy p new s neghbors. Based on the lfespan of p new and ts neghbors (Lemma 5.2), we can determne the lfespan of the neghborshps among them (Lemma 5.3), as well as the lfespan of dfferent stages of p news career (Lemma 5.4). Usng ths nformaton, we can now update the status and connectons of the skeletal grd cells n whch p new falls nto and n whch ts neghbors resde. For status of skeletal grd cells, the nserton of a new object may only cause two types of changes. Namely, t may promote the skeletal grd cells to become core cells or prolong ther core cell lfespans. status promoton: A new object p new may promote the skeletal grd cell Sc that t resdes n to become a core cell, f t becomes the frst core object n Sc. In ths case, we set the status of Sc to core cell and set ts core cell lfespan equal to the core object lfespan of p new. An example of ths case s shown Case 1 of status promoton n Fgure 6. p new may also cause a status change of a skeletal grd cell by upgradng ts non-core-object neghbors, whch resde n these affected skeletal grd cells, to core objects. In ths case, for each upgraded neghbor p upg of p new, we frst determne the lfespan of p upg s career by analyzng tself and ts neghbors. As every p upg was a non-core object, the non-core-career neghbor lst wll help us to quckly access all ts neghbors wthout runnng range query search agan. Thus, we update the status of the skeletal grd cells n whch p upg resdes to core cell and set ts core grd lfespan equal to the core object lfespan of p upg. Correspondngly, the non-core-career neghbor lst of each p upg also needs to be updated to exclude those objects that wll only be neghbor of p upg n ts core object career. An example of ths case s shown n Case 2 of status promoton n Fgure 6. status prolong: A new object p new may prolong the core cell lfespan of the skeletal grd cell Sc n whch t resdes, f p news core object lfespan s longer than that of any exstng object nsc. Inthscase, we setsc score cell lfespan equal to the core object lfespan of p new. An example of ths case s shown n Case 1 of status prolong n Fgure 6. p new may also prolong the core cell lfespans of the skeletal grd cells by extendng p new s neghbors core object lfespan. For each p new s neghbor whose core object lfespan s extended because of p new s arrval, p cole, we frst determne how long ts core object lfespan s extended, by analyzng t would have at least θ c neghbors n how many more wndows after p new jonng ts neghborhood. Then, we update the core cell lfespan of the skeletal grd cell n whch each p cole resdes to the core object lfespan of the correspondng p cole, f the later s longer. An example of ths case s shown n Case 2 of status promoton n Fgure 6. For connectons of skeletal grd cells, the nserton of a new object may only cause two types of changes. Namely, t may buld new connectons between skeletal grd cells or prolong the lfespan of exstng connectons. The mantenance process of the connectons follows the same prncples used n status mantenance logcs (detals omtted here for space reasons but can be found n [18]). Handlng Expratons. By usng the lfespan analyss technque ntroduced above, the mpact to the skeletal grd cells that could be caused by exprng objects has been prehandled when objects arrve. Therefore, no mantenance effort s needed for handlng cluster structure changes when ndvdual objects expre. After the wndow sldes, the only Fgure 6: Examples of updatng cell status. θ c = 4, grey crcle=edge pont, black crcle=core pont, number on each object= number of wndows the object can survve. update needed for the attrbutes of skeletal grd cells s to check whether the new wndow s out of the lfespans. If the new wndow s out of ts core cell lfespan, ts status needs to be set back to edge cell. If the new wndow s out of the lfespan of any of ts connectons, the correspondng connecton needs to be set back to false. Output Stage. At the output stage, the updated skeletal grd cells can be vewed as the vertces V n a graph G, and the connectons among them can be vewed as the edges E among the vertces. Therefore, we smply conduct a depth frst search on all the core cells to collect dfferent groups of connected core cells and the edge cells attached to them. Each connected group of skeletal grd cells consttutes the SGS summarzaton of a cluster C, C.SGS. Gven C.SGS, the full representaton of C can be easly fgured out by collectng all objects covered by core cells n C.SGS and those covered by the edge cells n C.SGS and connected to at least one core object n C.SGS s core cells. 6. PATTERN ARCHIVER The pattern archver handles two major tasks, namely pattern compresson and selectve pattern archval. 6.1 Mult-Resoluton Cluster Summarzaton Our proposed cluster summarzaton SGS supports multple resolutons. In general, the SGS n dfferent levels of resoluton follow the same desgn as presented n Secton 4. An SGS of any resoluton s composed of a sequence of skeletal grd cells, and each skeletal grd cell has the same 5 attrbutes ntroduced before. For any cluster C x, the SGS of C x formed by the Pattern Extractor s based on the fnest granularty, namely the smallest skeletal grds cells. Thus t s of the fnest resoluton. We call such SGS the Basc SGS of C x. The SGS n coarser resolutons are bult based on herarchcally combnng the Basc SGS. For a cluster C x, we say that the Basc SGS of C x s at Level 0 of the resoluton herarchy, noted as C x.sgs L 0. Any SGS n a coarser resoluton s at a Level n denoted as C x.sgs Ln. Each skeletal grd cell n C x.sgs Ln (n > 0), C x.sc Ln s formed by combnng the skeletal grd cells wthn a certan (θ)szedhypercubespacenc x.sgs L n 1. Forexample, a2- dmensonal cluster C x has SGS n two resolutons (Fgure 5). They are at Levels 0 and 1. If the compresson rate 128

9 θ = 3, each skeletal grd cell of SGS at Level 0 s made by combnng 3 2 adjacent skeletal grd cells at Level 1. Both the number of resolutons allowed and the parameter θ are part of the confguraton of our system. Such compresson process of buldng C x.sgs Ln can be fnshed wth a sngle scan of the skeletal grd cells n C x.sgs L n 1. Gven C x.sgs L n 1 and to buld C x.sgs Ln, we frst generate a set of skeletal grd cells for C x.sgsn L to cover the whole data space occuped by correspondng cells n the C x.sgs L n 1. Then we set the fve attrbutes for C x.sc Ln. The sde length of any C x.sg Ln s smply equal to the sde length of a skeletal grd cell at Level n-1 tmes θ. Any C x.sc Ln s a core cell f at least one C x.sc L n 1 covered by t s a core cell. Otherwse, t s an edge cell. The populaton of any C x.sc Ln s equal to the sum of the populaton of the C x.sc L n 1 s covered by t. The connecton vector of a C x.sc Ln s decded by the connectons between the boundary C x.sc L n 1 s covered by t and those covered by ts adjacent cells at level n-1. Budget- and Accuracy-Aware Resoluton Selecton. Gven the multple resoluton choces, the Pattern Archver can decde n whch resoluton to archve the patterns based on both the system-resource budget and the accuracy requred by the specfc analytcal tasks. In our SGS desgn, for a cluster summarzaton at a certan resoluton, both ts space consumpton and concseness are determnstc and easly calculatable. For space consumpton, gven the basc SGS of a cluster extracted, we can easly determne the number of skeletal grd cells needed n any other resoluton for the same cluster, by calculatng how many grd cells at that resoluton are needed to cover the same data space. Snce the SGS at dfferent resolutons have the same desgn, the amount of nformaton carred by each skeletal grd cell n any resoluton s fxed. Thus, one can easly determne how much storage space s needed exactly for a gven cluster n any resoluton. For accuracy, as the sze of the skeletal grd cells at all resolutons are known, the analysts would know exactly the granularty that ther analytcal task wll be workng on for a certan resoluton. 6.2 Selectve Pattern Archvng The Pattern Archver also selectvely pcks whch clusters to archve. Currently, our system supports several smple but useful cluster selecton mechansm, ncludng usng samplng technques to select certan numbers of clusters to archve n a perod of tme and usng feature selecton to only archve clusters wth certan features (e.g. only archve the clusters reachng a certan populaton or volume). More sophstcated pattern selecton technques, such as evoluton drven technques, wll be studed n our future work. 7. PATTERN STORAGE AND MATCHING 7.1 Pattern Organzaton n Pattern Base Our proposed cluster summarzaton method SGS empowers us to easly organze the extracted clusters based on ther features. In partcular, we buld two ndces for the archved clusters. One s based on the poston of each cluster, and the second s based on all other features of each cluster captured n SGS. We call the frst ndex the locatonal feature ndex. As mult-dmensonal objects, we express the poston of each cluster usng ts mnmum boundng rectangle (MBR). In our system, we employ one of the most wdely used ndces for MBRs, namely the R-tree ndex to organze them. The second ndex, called the non-locatonal feature ndex, organzes the clusters based on ther non-locatonal features. We use a four-dmensonal grd ndex to organze the clusters SGS, wth the four dmensons: the volume(number of skeletal grd cells, the status count (number of core cells), the average densty and the average connectvty of each cluster. 7.2 Cluster Matchng Process The Cluster Matchng Queres (see Fgure 3) are executed by the Pattern Analyzer. To execute such queres, we frst provde a dstance metrc (between 0-1) to measure the dstance between two clusters. The metrc s user-customzable based on applcaton semantcs. Dst(C a,c b ) = ps Dst locaton + X w Dst nlf (C a,c b ) ps,dst locaton = 0 1, w,dst nlf = [0,1], X w = 1) Inthsdstancemetrc, Dst locaton ndcatesthatwhether two clusters overlap (1) or not (0). ps ndcates whether the matchng s poston-senstve (1) or not (0). Dst nlf represents the dstance of two clusters on a specfc nonlocatonal feature and w represents the analyst-specfed weght on ths feature. To use ths dstance metrc, the analyst needs to frst specfy whether the matchng requred by her applcaton s poston-senstve, namely whether the matched clusters have to overlap n the data space. For the poston-senstve applcatons, ps = 1. If two clusters are not overlapped, Dst locaton (C a,c b ) = 1, the largest possble dstance between two clusters, ndcatng that the two clusters are not smlar and no further comparson on other features wll be needed. For the non-poston-senstve applcatons, snce ps = 0, the locatonal dstance between two clusters s consdered to be 0. The second part of the dstance metrc measures the dstance between two clusters on the four non-locatonal features, namely volume, status, populaton and connectvty. The dstance on these features are used n both the match canddate search and detaled cell level cluster match. Canddate Search. Gven a to-be-matched cluster, a customzed dstance metrc and a dstance threshold specfed by the analyst, our system frst searches the potental match canddates n the Pattern Base. In the postonalsenstve case, the Pattern Analyzer frst searches the locatonal feature ndex for the canddate clusters. If any overlapped clusters are found, t wll calculate ther nonlocatonal dstance wth the to-be-matched clusters, and returns the smlar clusters f the dstances are smaller than the threshold. In the non-poston-senstve case, the Pattern Analyzer drectly searches aganst the non-locatonal feature ndex for the canddates. Gven the dstance metrc and the dstance threshold, the Pattern Analyzer can determne the range of the search on each dmenson (feature). For example, gven the volume of the to-be-matched cluster equal to 20, the weght on sze dstance s 0.20, the overall dstance threshold s 0.2, the volume of the canddate clusters have to be between 14 and 30. Ths s because any other number x < 14 x > 30wll makeabs(x 20)/mn(x,20) > (0.2/0.4), whch wll defntely not fulfll the search cretera. The same prncple can be used on other features to determne the 129

10 range of search. Gven the search ranges on all dmensons, the Pattern Analyzer can quckly narrow down the canddate clusters to a small subset by searchng the feature ndex. Grd Cell Level Cluster Match. Gven a to-bematched cluster and a match canddate cluster for t, grd cell level cluster match compares the features of two clusters n ther correspondng sub-regons (skeletal grd cells). In partcular, grd cell level match uses the same customzable dstance metrc ntroduced earler, whle the dstance between two clusters s now measured by aggregatng the dfference between all the correspondng skeletal grd cell pars n these two clusters. More precsely, gven a certan algnment between two clusters C a and C b, 2 each skeletal grd cell Sc n C a may have a correspondng skeletal grd cell n Sc j, dependng on whether ts correspondng sub-regon s also covered by Sc j. If Sc has a correspondng skeletal grd cell Sc j n C b, ther dfference can be measured by comparng ther status, densty and connectvty features. Otherwse, Sc s assgned the maxmum dfference wth ts correspondng sub-regon, whch s not a part of C b and thus can vewed as an empty grd. When calculatng the dstance between two clusters C a and C b. we sum the dfference between each Sc n C a and ts correspondng sub-regon n C b to form the overall dstance between the two clusters. In the poston-senstve cases, no algnment s needed, or n other words, the algnment vector s always equal to [0,0,...,0]. Ths s because such applcatons requre any skeletal grd cell Sc n C a to be matched wth the skeletal grd cell Sc j n C b that have the same absolute poston n the data space. Therefore n such cases, we only need a sngle scan on the skeletal grd cells n two clusters to calculate the dstances between them. In the non-poston-senstve case, one or more algnments that mnmze the dstance between two clusters may exst. When gven suffcent computaton tme, such as n an offlne computaton, one could apply an exhaustve search to fnd such an optmal algnment. In our system, for onlne computaton, we use an A* style anytme search algorthm to search for the best algnment wthn a certan computaton tme budget. In partcular, we start wth an algnment that makes two clusters well overlapped. Then we contnuously search along the drecton of the most promsng nearby algnment, whch gves the smallest dstance so far. When the gven computaton tme budget s reached, we stop searchng and return the smallest dstance found so far as the dstance between the two clusters. 8. EXPERIMENTAL EVALUATION We conducted our experments on a Dell desktop wth an Intel Core2 2.2GHz processor and 3GB memory, runnng Wndows 7 professonal. We mplemented the algorthms n VC Real Datasets. We used two real streamng datasets n our experments. The frst dataset, GMTI (Ground Movng Target Indcator) [6], records the real-tme nformaton on movng objects gathered by 24 dfferent ground statons or arcraft n 6 hours from JontSTARS. It has around 100,000 2 An algnment for two Skeletal Grd Summarzatons (SGS) s a locaton shftng vector. For example, gven two three dmensonal clusters C a and C b, an algnment equal to [1,2,1] ndcates that any skeletal grd cell n C a wth locaton vector equal to [x,y,z] corresponds to a skeletal grd cell n C b wth locaton vector equal to [x+1,y+2,z+1], f any. records regardng the nformaton on vehcles and helcopters (speed rangng from mph) movng n a certan geographc regon. The second real dataset we use s the Stock Tradng Traces data (STT) from [11], whch has one mllon transacton records throughout the tradng hours of a day. For the experments that nvolve data sets larger than the szes of these two datasets, we append multple rounds of the orgnal data vared by settng random dfferences on all attrbutes, untl t reaches the desred sze. Alternatve Summarzaton Formats. We compare our proposed Skeletal Grd Summarzaton (SGS) wth three alternatve cluster summarzaton formats. 1) The tradtonal Centrod-Radus-Densty summarzaton (CRD). 2) Random Samplng Summarzaton(RSP). RSP for each cluster s generated by samplng the cluster members at a certan samplng rate R. To compare RSP wth our proposed SGS summarzaton, for each specfc cluster n the experment, R s always controlled to let ts RSP have the same memory consumpton wth the SGS for the same cluster. 3) Skeletal Pont Set (SkPS) summarzaton, our ntal cluster summarzaton desgn proposed n Secton Effcency of Cluster Extracton + Summarzaton In ths experment, we evaluate that how many system resources are needed to generate the alternatve cluster summarzatons respectvely. Snce our proposed soluton, C- SGS, ncorporates cluster extracton and summarzaton nto a sngle process, we compare ts performance wth the followng alternatves. 1) Extra-N: Extract clusters usng stateof-the-art algorthm Extra-N [16] but do not generate any cluster summarzaton. 2) Extra-N + CRD: Extract clusters usng Extra-N and then generate CRD for each extracted cluster. 3) Extra-N + RSP: Extract clusters usng Extra-N and then generate RSP for each extracted cluster. 4) Extra- N + SkPS: Extract clusters usng Extra-N algorthm and then generate (approxmated) SkPS for each cluster usng MG algorthm proposed n [9]. We frst run each alternatve method aganst the STT stream to extract clusters based on four dmensons, namely the transacton type (buy/sell), prce, volume and tme. To compare the performance of the alternatves when handlng clusters wth dfferent characterstcs, we use three dfferent query parameter settngs, namely case 1: (θ r = 0.05, θ c = 10), case 2: (θ r = 0.1, θ c = 8), case 3: (θ r = 0.2, θ c = 5). Also, for each case, we use three dfferent wndow parameter settngs, namely we fx the wndow sze (wn) for all three settngs at 10K tuples, whle varyng the slde sze slde to equal to 0.1K, 1K and 5K tuples respectvely. For each case, we frst verfy the correctness 3 of our proposed C-SGS cluster extracton method by comparng the clusters extracted by t n full representaton wth those extracted by the state-of-the art technque Extra-N. In all the test cases, we found that the clusters extracted by C- SGS are dentcal wth those extracted by Extra-N. For effcency, we measure two major performance metrcs for stream processng: 1) The average response tme for each wndow (denoted as Response Tme). For each wndow, we measure the average CPU tme elapsed from the tme that all new data have arrved to the tme that all clusters have been output n both the full and summarzed 3 All clusterng algorthms followng defnton n [8] should produce the same clusterng results gven a same nput object sequence. 130

11 representaton. The average response tme for each wndow shown n all cases are averaged among runnng for 10K wndows. 2) The memory footprnt, namely the peak memory utlzaton of each alternatve, among the 10K wndows. As shown n Fgure 8.1, compared to Extra-N, whch extracts clusters only but does not generate any cluster summarzaton (the baselne case), the other four alternatves, each generatng a specfc type of cluster summarzaton, exhbt some overheads n terms of CPU tme utlzaton. However, such overhead caused by C-SGS, Extra-N + CRD, and Extra-N + RSP, s very modest, f not neglectable. The reason for such modest overhead caused by Extra-N + CRD and Extra-N + RSP s obvous. Ths s because CRD and RSP are very smple summarzaton formats that can easly be generated by at most two scans of the cluster members of each cluster. The overhead caused by our proposed soluton C-SGS s comparable wth those two smple summarzaton methods. Ths s because the major computaton needed for generatng the SGS cluster summarzaton, namely determnng the status and connectons among skeletal grd cells, s elegantly pggy-backed by the cluster extracton process tself. The CPU overhead of Extra-N + SkPS s sgnfcantly hgher than that of the other alternatves. Ths s because generatng SkPS s very expensve computatonally [9]. For dfferent wndow parameter settngs, C-SGS has lower overhead for the settngs wth larger wn/slde rates. Ths s because the performance of Extra-N s affected by the ncreasng number of vews that needs to be mantaned, whch s equal to wn/slde (see [16] for detals), whle the meta-data mantaned by C-SGS and the correspondng mantenance effort s ndependent from ths rato. Memory-wse, as shown n(fgure 8.1), our proposed method C-SGS also exhbts very lmted overhead n all test cases. Ths s because the process of generatng SGS happens n place wth the cluster extracton process. Smlar performances are also observed n the same experments but usng GMTI data. We have also conducted an experment showng the superorty of our proposed method when usng tme-based wndows and under fluctuatng nput rate. The detals of these experments mentoned can be found n our techncal report [18]. In concluson, usng our proposed C-SGS soluton, we can effcently generate the Skeletal Grd Summarzaton (SGS) for extracted clusters durng onlne clusterng process, wth very lmted system resource overhead. 8.2 Effcency of Cluster Matchng Queres Next, we study the performance for runnng the cluster matchng queres usng our proposed summarzaton format SGS and other alternatve summarzaton formats. We run three queres usng the same pattern parameter settngs as used n the prevous experment but wth the same wndow parameter settng (Wn = 10K, Slde = 1K) aganst the STT data usng our proposed C-SGS method. We vary the szeofthepatternbase equalfrom0.1k,1kand10krespectvely. In each test case, we run each clusterng query and archve all the clusters detected nto the Pattern Base untl the requred number of archved clusters s reached. For each archved cluster, we also generate and keep the other three alternatve cluster summarzaton formats for evaluatng other matchng methods. Once the requred number of clusters s archved, we stop archvng and randomly pck Fgure 7: CPU tme and Memory comparsons for generatng alternatve summarzatons. 100 newly detected clusters as to-be-matched clusters. For each to-be-matched cluster, we run four matchng queres for t aganst the archved clusters, each usng one alternatve cluster summarzaton method and the correspondng dstance metrc. In partcular, we mplement a subtracton functon to measure the dstance between the CRD of two clusters, whch gves equal weght to the three cluster features captured n CRD, namely the centrod, range and densty. We use the subset matchng algorthm presented n [15] to calculate the dstance between the RSP of two clusters. We use the graph edt dstance algorthm presented n [13] to calculate the dstance between the SkPS of two clusters. We gve equal weght to all four features when measurng the dstance between the SGSs of two clusters. For each Pattern Base sze, we measure the average response tme for all cluster matchng queres and memory space consumed by storng cluster summarzatons. As shown n Fgure 8, when matchng aganst 0.1K clusters, the average response tme for each cluster matchng query usng SGS s less than 0.1 second. For the 1K and 10K cases, the average response tme for our soluton s only around 0.5 seconds and 3 seconds. Such hgh effcency s comparable wth cluster matchng usng CRD, whch s very fast because of ts extremely smple matchng mechansm (smply three subtracton operatons). Ths s due to the desgn of SGS, whch effectvely summarzes the key features of each cluster on both cluster and grd levels. In partcular, by usng our proposed two-phase matchng strategy, the majorty of the canddates n the pattern base are fltered out n the summarzaton matchng phase. Thus, the more expensve grd level matchng s only needed for a very small porton of the canddates. In our experment, we found that only 6% of the canddate clusters necesstated the grd level match on average durng the cluster matchng process. Memory-wse, SGS consumes only 0.12M, 1.38M and 12.24M memory space to store 0.1K, 1K and 10K clusters respectvely (Fgure 8). In partcular, each 4-dmensonal skeletal grd cell only consumes 23 bytes, poston: 16 bytes (4 ntegers), status: 1 byte (1 boolean), densty: 4 bytes (1 nteger), connecton: 2 bytes (2 4 = 16 booleans). In all test cases, the average number of skeletal grd cells n each clusters 68. Therefore, only1.5k memorysneededtostore the SGS of each cluster on average. Compared wth the memory space needed for storng the full representaton of the clusters, whch need 6.4M, 75.2M and 680.2M to store 0.1K, 131

Its performance s comparable wth matchng smple CRD cluster summarzaton.

12 1K and 10K clusters respectvely, the average compresson rate of SGS n our experment s around 98%. In concluson, our proposed soluton demonstrates hgh effcency for cluster matchng queres, whch s sgnfcantly better than matchng SkPS or RSP. Its performance s comparable wth matchng smple CRD cluster summarzaton. However, matchng CRD s shown to have a much worse cluster matchng qualty compared wth our proposed method of matchng SGS (see next experment below). Fgure 8: CPU tme and Memory comparson for cluster matchng queres usng alternatve cluster summarzaton methods 8.3 Qualty of Cluster Matchng To measure the qualty of cluster matchng usng alternatve summarzaton formats, we nvted 20 human analyts (all WPI graduate students) to vsually analyze the smlarty between the to-be-matched cluster and the matched clusters found for them usng one alternatve method. The analyss process s supported by VStream [14], a freeware multvarate data vsualzaton tool, whch has been shown to be effectve for helpng human analysts to observe and understand mult-dmenstonal clusters n streams. For each to-be-matched cluster, the analysts are asked to rate the top three smlar clusters found by each summarzaton format nto three categores, namely very smlar, smlar, and not smlar. Fgure 9: Smlar rate gven by users for matched clusters found by alternatve summarzatons As shown n Fgure 8.3, our proposed summarzaton method SGS demonstrates a hgh smlar rate, whch s sgnfcantly better than all the other alternatves. Ths ndcates that the human analysts agree wth most of the smlar clusters found usng SGS, whle dsagreeng on a large percentage of those found usng other alternatves. Ths shows the hgh effectveness of SGS summarzaton n terms of cluster matchng. Due to page lmt, the detaled expermental setup and result analyss of ths experment are omtted here but can be found n our techncal report [18]. We also conducted a seres of experments to confrm both the effcency and effectveness of cluster matchng queres when usng SGS wth dfferent resolutons. The detals of those experments can be found n our techncal report [18]. 9. CONCLUSION In ths work, we present a framework to support summarzaton and matchng of densty-based clusters n streamng envronments. Frst, our work solves several open problems for densty-based cluster analyss, namely, desgnng a descrptve yet compact summarzaton method for such clusters. Second, we present an effcent computaton strategy to quckly summarze the detected clusters nto SGS durng the onlne clusterng. Lastly, we desgn a cluster archvng and matchng mechansm, whch allows the analysts to submt cluster matchng queres to fnd smlar clusters detected earler n the stream hstory. Our expermental study demonstrates the clear superorty of our proposed methods on both the effcency and effectveness. 10. REFERENCES [1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clusterng evolvng data streams. In VLDB, pages 81 92, [2] A. Arasu, S. Babu, and J. Wdom. The cql contnuous query language: semantc foundatons and query executon. VLDB J., 15(2): , [3] F. Cao, M. Ester, W. Qan, and A. Zhou. Densty-based clusterng over an evolvng data stream wth nose. In SDM, pages , [4] Y. Chen and L. Tu. Densty-based clusterng for real-tme stream data. In KDD, pages , [5] B.-R. Da, J.-W. Huang, M.-Y. Yeh, and M.-S. Chen. Adaptve clusterng for multple evolvng streams. IEEE Trans. Knowl. Data Eng., 18(9): , [6] J. N. Entzmnger, C. A. Fowler, and W. J. Kenneally. Jontstars and gmt: Past, present and future. IEEE Trans on Aero and Elec Sys, 35(2): , [7] M. Ester, H. Kregel, J. Sander, M. Wmmer, and X. Xu. Inc. clusterng for mnng n a data warehousng envronment. In VLDB, pages , [8] M. Ester, H. Kregel, J. Sander, and X. Xu. A densty-based algorthm for dscoverng clusters n large spatal databases wth nose. In KDD, pages , [9] S. Guha and S. Khuller. Approx. algo. for connected domnatng sets. Algorthmca, 20: , [10] J. A. Hartgan and M. A. Wong. A k-means clusterng algorthm. Appled Statstcs, 28(1), [11] I. INETATS. Stock trade traces. [12] L. Lels and J. Sander. Sem-supervsed densty-based clusterng. In ICDM, pages , [13] M. Neuhaus, K. Resen, and H. Bunke. H.: Fast suboptmal algorthms for the computaton of graph edt dstance. In SSSPR, pages , [14] D. Yang, Z. Guo, Z. Xe, E. A. Rundenstener, and M. O. Ward. Interactve vsual exploraton of neghbor-based patterns n data streams. In SIGMOD, pages , [15] D. Yang, E. A. Rundenstener, and M. O. Ward. Nugget dscovery n vsual exploraton by query consoldaton. In CIKM, pages , [16] D. Yang, E. A. Rundenstener, and M. O. Ward. Neghbor-based pattern detecton for wndows over streamng data. In EDBT, pages , [17] D. Yang, E. A. Rundenstener, and M. O. Ward. A shared executon strategy for multple pattern mnng requests over streams. PVLDB, 2(1): , [18] D. Yang, E. A. Rundenstener, and M. O. Ward. Summarzaton and matchng of complex patterns n streamng envronment. WPI-CS-TR-11-04, dyang/wpicstr1104.pdf. [19] T. Zhang, R. Ramakrshnan, and M. Lvny. Brch: an effcent data clusterng method for very large databases. In ACM SIGMOD, pages ,

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally