Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases

Size: px

Start display at page:

Download "Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases"

Doreen Miles
5 years ago
Views:

1 Efficiently Supporting Multiple Siilarity Queries for Mining in Metric Databases Bernhard Braunüller, Martin Ester, Hans-Peter Kriegel, Jörg Sander Institute for Coputer Science, University of Munich Oettingenstr. 67, D München, Gerany eail: {braunue ester kriegel Abstract Metric databases are databases where a etric distance function is defined for pairs of database objects. In such databases, siilarity queries in the for of range queries or k-nearest neighbor queries are the ost iportant queries. In traditional query processing, single queries are issued independently by different users. In any data ining applications, however, the database is typically explored by iteratively asking siilarity queries for answers of previous siilarity queries. In this paper, we introduce a generic schee for such data ining algoriths and we develop a ethod to transfor such algoriths in a way that they can use ultiple siilarity queries, i.e. sets of queries issued siultaneously. We investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of ultiple siilarity queries. The proposed techniques apply to any type of siilarity query and to an ipleentation based on an index or using a sequential scan. Parallelization yields an additional ipressive speed-up. An extensive perforance evaluation confirs the efficiency of our approach and we conclude that ultiple siilarity queries should be provided as a basic DBMS operation in order to support any data ining applications in etric databases. 1 Introduction Metric databases are databases where a etric distance function is defined for pairs of database objects. A proinent special case are databases of objects fro a vector space, that is objects with nueric attributes. For exaple, ultiedia objects [Fal+ 94] are typically represented by a large nuber of nueric features such as shape descriptors or color histogras. In any scientific applications, e.g. in astronoy [Hog 97], autoatic facilities easure a large nuber of nueric values for each database object such as the aplitude eitted in soe frequency band. On the other hand, in a database onitoring WWW accesses, the objects ay odel URLs and these objects are not fro a vector space but a etric distance function can be supplied. Siilarity between database objects is expressed by the distance function such that a low distance corresponds to a high degree of siilarity whereas two objects with a large distance are considered to be rather dissiilar. Siilarity queries [WSB 98], e.g. range queries or k-nearest neighbor queries, are the ost iportant queries in etric databases. Such queries play a ajor role in applications such as ultiedia systes, decision support systes and data ining. A lot of research on analyzing large databases - anually and autoatically - has been conducted. Data exploration is the process of anually exploring a database [KLTW 96]. A user starts at a given database object and fro there he or she interactively navigates through the database, for exaple by iteratively retrieving all siilar objects. That is, the answers of previous queries ay be used as query objects for new siilarity queries. Knowledge discovery in databases (KDD) has been defined as the non-trivial process of discovering valid, novel, potentially useful, and ultiately understandable patterns fro data [FPS 96]. The core step of the KDD process 1

2 is the step of data ining, i.e. the application of appropriate algoriths that autoatically produce a particular enueration of patterns over the data. For exaple, a density-based clustering algorith such as DBSCAN [EKSX 96] starts fro soe object and repeatedly retrieves the neighborhood of objects which have been retrieved by previous queries as long as the density in this neighborhood is large enough. In traditional query processing, single queries are issued independently by different users. In anual data exploration as well as in autoatic data ining, however, any siilarity queries ust be answered in a single application. We define ultiple queries as sets of queries issued siultaneously. Clearly, ultiple queries provide uch ore potential for query optiization than single queries. In this paper, we investigate two orthogonal approaches to speed-up the processing of ultiple siilarity queries in etric databases: reduce I/O cost (that is, the nuber of disk accesses) and reduce CPU cost (that is, the nuber of distance calculations). Furtherore, we explore the potential of parallelization. The proposed techniques can be cobined in order to obtain a axiu perforance of processing ultiple siilarity queries. The rest of this paper is organized as follows. In section 2, we briefly review the standard ethods of processing single siilarity queries and we introduce soe basic notions and algoriths. Section 3 introduces an algorithic schee typical for any data ining applications and discusses several instances of this schee. Furtherore, we develop a ethod to transfor such algoriths so that they issue sets of ultiple queries. The concepts of ultiple siilarity queries are introduced in ore detail in section 4. Section 5 presents several techniques for efficiently supporting ultiple siilarity queries. An extensive perforance evaluation of the proposed techniques on two real databases is presented in section 6. Section 7 suarizes the contributions of this paper and outlines soe issues for future research. 2 Processing Single Siilarity Queries Let the database objects be drawn fro a set of Objects and let dist be a etric distance function for pairs of objects, i.e. dist: Objects Objects R +, and dist satisfies the following conditions. O 1, O 2, O 3 Objects: 1) dist(o 1, O 2 ) = 0 O 1 = O 2. (identity) 2) dist(o 1, O 2 ) = dist(o 2, O 1 ). (syetry) 3) dist(o 1,O 3 ) dist(o 1,O 2 ) + dist(o 2,O 3 ). ( -inequality) Often, the Euclidean distance or a weighted Euclidean distance is used as the distance function but, depending on the application, other distance functions ay be ore appropriate. For instance, quadratic for distance functions were successfully applied for an iage database using color histogras as features [SK 97]. Definition 1: (siilarity query) Let DB Objects be a database and let Q Objects be a query object. Let T denote the type of the siilarity query and let si T : Objects Objects Boolean be a predicate defining the siilarity of pairs of objects wrt. to the type T. A siilarity query, denoted as DB.siilarity_query(Q,T), returns the following database objects: DB.siilarity_query( Q, T) = { O DB si T ( OQ, )}. The specification of the query type T consists of three coponents: - T.range: a real nuber specifying a axiu distance between Q and an answer. - T.cardinality: an integer nuber defining the axiu cardinality of the set of answers. 2

3 - T.kind: a string indicating how to cobine the range condition and the cardinality condition. The well-known types of siilarity queries are obtained by different specializations of the query type. Definition 2: (range query) A range query with respect to a database DB Objects and a query object Q Objects is a siilarity query with T.range = ε, T.cardinality = + and T.kind = range which returns the following subset of database objects: DB.siilarity_query( Q, T) = { O DB dist( O, Q) ε}. Definition 3: (k-nearest neighbor query) A k-nearest neighbor query with respect to a database DB Objects and a query object Q Objects is a siilarity query with T.range = +, T.cardinality = k and T.kind = k-nearest neighbor which returns a set NN Q () k DB that contains k objects fro the database and for which the following condition holds: P NN Q () k, O DB NN Q () k : dist( P, Q) dist( O, Q). Other types of siilarity queries have been proposed in the literature. For instance, we ay be interested in the k-nearest neighbors but only in those within a specified range. In order to speed-up siilarity query processing, any spatial index structures (for good surveys see [Sa 89], [GG 98]) have been developed which are applicable for the iportant special case where the database objects are fro a vector space. For instance, the R-tree [Gut 84] generalizes the one-diensional B-tree to d-diensional data spaces, that is an R-tree anages d-diensional hyper-rectangles instead of one-diensional nueric keys. The R-tree and its variants such as the R*-tree [BKSS 90] are efficient only for relatively sall nubers of diensions d. Recently, index structures have been designed which are also efficient for soe larger values of d. For instance, the X-tree [BKK 96] is siilar to an R*-tree but introduces the concept of supernodes, i.e. nodes of variable size in the directory of the tree. Directory nodes are erged into one supernode, i.e. directory nodes are not split, if there is a high probability that all parts of the node have to be searched anyway for ost queries. In [Kei 97] and [WSB 98], it is shown that under the assuption of uniforly distributed data, above a certain diensionality no index structure can process a nearest neighbor query efficiently. Thus, it is suggested to use the sequential scan which obtains at least the benefits of sequential rather than rando disk I/O. In the VA-file [WSB 98], clever bit encodings of the data are used to speed-up the scan. The above index structures are only applicable for vector spaces. The ore general case of etric databases, however, is also iportant in applications such as WWW access log databases. Then, the database objects ay be sessions grouping all log entries with identical IP address and user id within a given axiu tie gap [MJHS 96]. General etric databases require other types of index structures, the so called etric trees. In these etric trees the triangle inequality is used to prune the search tree while processing a siilarity query. Most of these structures are static in the sense that they do not allow dynaic insertions and deletions of objects. A recent paper [CPZ 97] has introduced a dynaic etric index structure, the M-tree, which is a balanced tree that can be anaged on secondary eory. The leaf nodes of an M-tree store all the database objects. Directory nodes store so-called routing objects and associated covering radii to guide the search operations. 3

4 Query processing using a sequential scan is straightforward: all objects ust be visited to answer a query. When using an index it ay be possible to exclude large proportions of the data fro a search. To answer a range query using a tree-based index, the set of approxiations (e.g. hyper-rectangles in case of an R-tree) intersecting the query region is deterined recursively starting fro the root. In a directory node, the entries intersecting the query region are deterined and then their referenced child nodes are searched until the data pages are reached. The procedure is ore sophisticated for a k-nearest neighbor query because we do not know the k-nearest neighbor distance in advance. The algorith proposed by [HS 95] has been proven in [BBKK 97] to iniize the nuber of pages read fro disk. This algorith processes the data pages in ascending order of distance fro the query point and does not load those pages with an approxiation further away than the k-nearest neighbor found so far. To conclude this section, we present an algorith for processing single siilarity queries. This algorith is applicable for any type of siilarity query and it can be ipleented either by using an index structure or by perforing a sequential scan. Figure 1 presents the algorith as a ethod of the class DB (database). It takes two arguents, a query object Q and a query type T, and returns a list of objects answering the siilarity query. DB::siilarity_query(object Q; type T) Answers := initialize_answer_list(); deterine_relevant_data_pages(q, T); QueryDist :=T.Range; while Self.unprocessed_pages() do NextPage := read_next_page_fro_disk(); for each object O in NextPage do Distance := dist(o,q); if Distance QueryDist then Answers.insert(O); // in ascending order of dist(a,q) if Answers.cardinality() > T.Cardinality() then Answers.reove_last_eleent(); QueryDist := adapt_query_dist(distance,querydist,t); Self.prune_pages(QueryDist); return Answers; Figure 1: Algorith siilarity_query We discuss the ost iportant details of this algorith. The ethod DB::deterine_relevant_data_pages(Q,T) is based on the algorith presented in [BBKK 97] and it constructs a sequence of the physical addresses of all data pages which ay contain answers of the siilarity query specified by Q and T. Note that the resulting sequence is anaged as a private attribute of the class DB which is read by the ethod DB.unprocessed_pages() and which is updated by DB.read_next_page_fro_disk(). If the ipleentation akes use of an index structure, then a subset of all data pages of DB ay be recognized as irrelevant and thus is not returned. Otherwise, if a sequential scan is perfored, all data pages of DB are relevant. In both cases, the relevant data pages are ordered according to their physical address such that the nuber of disk seeks is iniized. DB.adapt_query_dist(Distance,QueryDist,T) changes the QueryDist only in the case of a k-nearest neighbor query but not in the case of a range query. DB.prune_pages(QueryDist) reoves all eleents page fro the internal DB attribute of relevant data pages satisfying dist(page,q) > QueryDist. Clearly, this ethod perfors no operation if QueryDist has not been adapted. 4

5 3 Data Mining Using Multiple Siilarity Queries In any data ining applications the database is explored by iteratively considering the neighborhood of soe start objects. In this section, we introduce a generic schee for such data ining algoriths and discuss soe typical instances of the schee. Furtherore, we develop a ethod to transfor such algoriths in a way that they can use ultiple siilarity queries instead of single siilarity queries. 3.1 Iterative Neighborhood Exploration Many data ining algoriths start fro a set of specified database objects and iteratively consider the neighborhood of the visited objects. The neighborhood of a given object is defined as the set of siilar database objects with respect to a siilarity query. We introduce a generic schee for such algoriths which we call ExploreNeighborhoods. Figure 2 depicts the algorithic schee in pseudo code notation where DB denotes a database, StartObjects denotes a subset of DB and SiType specifies the type of siilarity query. The dots in the arguent list of soe functions indicate additional arguents which ay be necessary for different instances of this algorithic schee. ExploreNeighborhoods(DB, StartObjects, SiType,...) ControlList := StartObjects; while ( condition_check(controllist,...) = TRUE ) do Object := ControlList.choose(...); proc_1(object,...); Answers := DB.siilarity_query(Object, SiType); proc_2(answers,...); ControlList := ( ControlList» filter(answers,...) ) - {Object}; end while; Figure 2: Algorithic schee ExploreNeighborhoods Starting fro the objects in the set StartObjects, the algorith repeatedly retrieves the neighborhood of objects taken fro the ContolList as long as the function condition_check returns TRUE for the ControlList. In the ost siple for, the function checks whether ControlList is not epty. If the neighborhood of objects should only be investigated up to a certain depth, then an additional paraeter for the nuber of steps that have to be perfored can be used in the function condition_check. The control structure of the ain loop works as follows: objects are selected fro the ControlList, one at a tie, and a siilarity query is perfored for this object. The procedures proc_1 and proc_2 perfor soe processing on the selected object as well as on the answers of the siilarity query that will vary fro task to task. Before repeating the loop, the ControlList is updated. Soe or all of the answers which are not yet processed are siply inserted into the ControlList. The function filter(answers,...) reoves fro the set of answers at least those objects which have already been in the ControlList in previous states of the algorith, if any exists. This ust be done to guarantee the terination of the algorith. 3.2 Instances of Iterative Neighborhood Exploration In the following, we discuss several typical instances of the ExploreNeighborhoods schee in order to show that any data ining applications follow this schee: 5

6 - Manual Data Exploration When anually exploring a ultiedia database, for exaple, in proc_2 the answers are visualized and the user ay store the ultiedia objects considered to be interesting. In the filter the user ay prune answers which are too dissiilar fro the initial start objects. - Spatial Association Rules [KH 95] introduces spatial association rules describing associations between objects based on spatial neighborhood relations. For instance, a rule ay be discovered stating that 80% of the selected towns are close to soe water such as a lake or river. In this algorith, the set StartObjects is equal to the set of all database objects of a specified type such as town. SiType corresponds to the type of spatial neighborhood such as intersects which is given by the user. proc_2 calculates the support, that is the relative frequency, of the retrieved pairs of objects and the filter passes all of these pairs which have at least a specified iniu support. - Density-Based Clustering DBSCAN [EKSX 96] is a typical density-based clustering algorith. To find a cluster, DBSCAN starts with soe database object o and retrieves all objects density-reachable fro o with respect to two paraeters Eps and MinPts. Initially, ControlList contains an arbitrary database object and range queries with a range of Eps are used as siilarity queries. proc_2 counts the answers and the filter passes all answers which have not yet been assigned to soe cluster if the cardinality of the set of answers is at least MinPts. - Siultaneous Classification of a Set of Objects In an astronoy database [Hog 97], for exaple, all new stars observed by a telescope during the night are processed and added to the database the next day. Part of this processing is to classify the set of new stars, that is to assign each of the to one of the well-known classes. k-nearest neighbor classifiers [Mit 97] are effective for this task and they issue a k-nearest neighbor query for each of the objects to be classified. In this case, proc_1 is epty. proc_2 finds the ajority class in the set of k-nearest neighbors and filter always returns an epty list, that is no additional query objects are generated. - Spatial Trend Detection A spatial trend has been defined as a regular change of one or ore non-spatial attributes when oving away fro a given start object o [EFKS 98]. Neighborhood paths starting fro o odel the oveent and a regression analysis is perfored on the respective attribute values for the objects of a neighborhood path to describe the regularity of change. For this data ining task, the ExploreNeighborhoods loop is additionally controlled by the nuber of steps (i.e. the length of a neighborhood path) and the procedures proc_1 and proc_2 perfor the regression analysis on the paths. - Proxiity Analysis The goal of proxiity analysis is to explain the existence of soe cluster of objects by using the features of neighboring objects. [KN 96] presents an algorith which can efficiently find the top-k objects that are closest to a given cluster. A second algorith takes these k objects as input and finds the features that are coon to ost of the. Characteristic properties such as ost of the clusters are close to private schools and parks ay be discovered. In this case, StartObjects contains all the objects of the specified cluster. proc_2 considers the features of the k-nearest neighbors and returns the ost coon ones. The filter returns an epty list iplying that no additional query objects are added. 6

7 3.3 Transforation into Multiple Query For An instance of Iterative Neighborhood Exploration can benefit fro a ultiple siilarity query because the algorithic schee can be transfored into a ultiple query for such that it uses ultiple siilarity queries instead of single siilarity queries. If the evaluation of a ultiple siilarity query for query objects can be perfored ore efficiently than the evaluation of the corresponding single siilarity queries - which is possible as we will see in the next sections - the runtie of the whole class of ExploreNeighborhoods-algoriths will be iproved. We assue the following ethod to be available: ListOfAnswerSetsDB::ultiple_siilarity_query(ListOfObjects, SiTypes); Let the siilarity queries for all of the query objects in ListOfObjects be soehow perfored siultaneously and let the generated answers for all of these queries be stored in soe internal buffer of the DBMS. If each of the queries is copletely answered after the call of ultiple_siilarity_query, successive calls containing queries which were already asked in a previous call of the ethod then can just pick the answers fro the buffer. Note, however, that we do not require the ultiple siilarity query to generate a coplete set of answers for each of the posed queries. One call of a ultiple siilarity query ust only guarantee that the answers for the first query object are coplete. We will discuss this weak specification of a ultiple siilarity query in ore detail in the next section. The intuitive eaning is that if we ask a siilarity query for the first object, we can additionally infor the DBMS that the siilarity queries for the other query objects will probably be asked later and the DBMS ay use this inforation to iprove the overall runtie for the set of queries by retrieving (soe of) the respective answers in advance. The transfored algorithic schee called ExploreNeighborhoodsMultiple is presented in figure 3. As we can see, the reforulation can be done in a purely syntactical way. Thus, a query optiizer can autoatically use ultiple siilarity queries to efficiently process an ExploreNeighborhoods-algorith if a ultiple siilarity query is available as a basic DBMS operation. Obviously, the algorithic schee ExploreNeighborhoodsMultiple perfors exactly the sae task as the original ExploreNeighborhoods schee. The only differences are that a set of objects is selected fro the ContolList instead of selecting a single object and a ultiple siilarity query is perfored instead of a single siilarity query. However, in one execution of the ain loop, the algorith processes only the first eleent of the set of selected objects and its corresponding set of answers. ExploreNeighborhoodsMultiple(DB, StartObjects, SiTypes,...) ControlList := StartObjects; while ( condition_check(controllist,...) = TRUE ) do ListOfObjects := ControlList.choose_ultiple(...); // ListOfObjects = [object 1,..., object ] proc_1(listofobjects.first(),...); SetOfAnswers:= DB.ultiple_siilarity_query(ListOfObjects, SiTypes); // SetOfAnswers=[answers 1,...,answers ], SiTypes=[SiType 1,...,SiType ] proc_2(setofanswers.first(),...); ControlList := (ControlList filter(setofanswers.first(),...)) - {ListOfObjects.first()}; end while; Figure 3: Algorithic schee ExploreNeighborhoodsMultiple 7

8 4 Multiple Siilarity Queries In this section, the notion of a ultiple siilarity query is presented in ore detail. Definition 4: (ultiple siilarity query) Let DB Objects be a database containing n objects. Let Queries = [Q 1, Q 2,..., Q ] be a sequence of query objects Q i Objects and let SiTypes = [T 1,..., T ] be the corresponding sequence of query types. A ultiple siilarity query, denoted by DB.ultiple_siilarity_query(Queries, SiTypes), returns a sequence Answers =[A 1,..., A ], containing for each eleent Q i in Queries a corresponding set A i of objects of DB where the following holds: 1.) A 1 = DB.siilarity_query(Q 1, T 1 ), and 2.) A i DB.siilarity_query(Q i, T i ) for all 2 i. Only for the first query, all answers ust be deterined in a single call of a ultiple siilarity query. The reaining queries ay be answered copletely or partially, depending on the ipleentation of the ultiple siilarity query. We will argue in the next subsection that an increental ipleentation, i.e. only the first query is answered copletely and other queries are answered partially in a single call of the ethod, ay be ore efficient if we consider the overall run-tie of an ExploreNeighborhoodsMultiple algorith. Using ultiple siilarity queries instead of single siilarity queries we ay spend less I/O tie and less CPU tie for a set of queries. First, we read a single page only once for the whole set of queries. Second, knowing a whole set of query objects in advance, we can use the distances between these query objects to replace expensive distance coputations by significantly cheaper distance coparisons using the triangle inequality. Our algorith for a ultiple siilarity query is depicted in figure 4. The only parts that differ fro the algorith for a single siilarity query (see figure 1) - besides the obvious handling of ultiple query objects and types - are as follows: - restore_fro_buffer([q 1,..,Q ],[T 1,..,T ]) - buffer_answers([answers 1,..,Answers ]) In the beginning, we have to restore (partial) answers fro an internal buffer -if available- and we have to store generated answers into this buffer at the end. - deterine_relevant_data_pages([q 1,..., Q ], [T 1,..., T ]) This procedure returns the set of all data pages relevant for Q 1 and, additionally, it returns soe or all of the relevant data pages for the reaining query objects, that is relevant_pages( Q 1 ) deterine_relevant_pages( [ Q 1,, Q ], [ T 1,, T ]) relevant_pages( Q i ) - First, a subset of the set of all queries is chosen which should be copletely answered. If the ipleentation is based on the linear scan, each data page is relevant. If using an index structure such as the X-tree, the set of all data pages which cannot be excluded fro the search for at least one of the selected queries is deterined fro the directory of the tree. Note that our ipleentation of a ultiple siilarity query on top of an index structure converges to the ethod for the linear scan when the page selectivity of the index decreases, e.g. with increasing diension of the data space. In the worst case, the index has no selectivity at all, which eans i = 1 8

9 DB::ultiple_siilarity_query(objects [Q 1,..., Q ]; types [T 1,..., T ]) [Answers 1,..., Answers ] := Self.restore_fro_buffer([Q 1,..., Q ], [T 1,..., T ]); Self.deterine_relevant_data_pages([Q 1,..., Q ], [T 1,..., T ]); for i fro 1 to do QueryDists i := T i.range; for i fro 1 to, for j fro i+1 to do QObjDists ij := dist(q i, Q j ); while Self.unprocessed_pages() do NextPage := Self.read_next_page_fro_disk(); for each object O in NextPage do for i fro 1 to do AvoidingDists i := UNDEFINED; for i fro 1 to do if Self.page_is_relevant(NextPage, Q i ) then if not avoid_dist_coputation(o,q i,qobjdists,avoidingdists) then Distance := dist(o, Q i ); AvoidingDists i := Distance; if Distance QueryDists i then Answers i.insert(o); // in ascending order of dist(a,q) if Answers i.cardinality() > T i.cardinality() then Answers i.reove_last_eleent(); QueryDists i := Self.adapt_query_dist(Distance,QueryDists i,t i ); Self.prune_pages(QueryDists i ); Self.buffer_answers([Answers 1,..., Answers ]); return [Answers 1,..., Answers ]; Figure 4: Algorith ultiple_siilarity_query that no data page can be excluded fro the siilarity search. The details of deterining the relevant data pages are presented in section avoid_dist_coputation(o,q i,qobjdists,avoidingdists) We calculate the inter-object distances for all pairs of query objects and store the into QObjDists. Distances which ust be calculated are teporarily stored into AvoidingDists. These distances and the QObjDists are needed for the application of the triangle inequality perfored by avoid_dist_coputation(o,q i,qobjdists, AvoidingDists). The details of avoiding distance calculations are presented in section Efficient Support for Multiple Siilarity Queries In this section, we present techniques that significantly reduce the aount of disk I/O as well as the nuber of CPU operations needed to evaluate a ultiple siilarity query copared to a set of single siilarity queries. Furtherore, we briefly discuss how to achieve a further perforance gain when using parallelization techniques for the processing of ultiple siilarity queries. Note that there is an upper liit for the nuber of ultiple siilarity queries which can be processed siultaneously. This liit is deterined by the aount of ain eory available to buffer the answers and by the coputational overhead for calculating the inter-object 9

10 distances between all pairs of query objects. Therefore, we assue that a total nuber of M siilarity queries is processed in M ---- consecutive blocks of ultiple queries. Let C i = i C I/O + i C CPU be the cost for siultaneously processing i siilarity queries. Then, the cost for evaluating M queries using single siilarity queries is equal to M C 1, the cost for evaluating M queries using ultiple siilarity queries is equal to M---- C. Consequently, for a ultiple siilarity query to iprove the efficiency of single siilarity queries, the following condition ust hold: C < C Reducing I/O Cost We discuss the algorith ultiple_siilarity_query fro an I/O cost point of view for two different ipleentations - one on top of the linear scan and another one on top of an index structure such as the X-tree. When perforing a linear scan, the ultiple siilarity query perfors a condition check for all query objects while perforing a single scan over the database and returns a sequence of answers for each query object. If the diension d of the data space is very high, the scan ay actually be the ost efficient ethod to answer siilarity queries because, in general, the perforance of index structures degenerates with increasing diension d. When using a tree-like index structure (e.g. X-tree) to answer a single siilarity query, a set of data pages which cannot be excluded fro the search is deterined fro the directory of the tree. These pages are then exained and the answers to the query are deterined. To answer a ultiple siilarity query for a set Q = [Q 1,..., Q ] of query objects, we propose a siilar procedure. First, we deterine the data pages to be read as if answering only the siilarity query for Q 1. However, when processing these pages, we do not only collect the answers in the neighborhood of Q 1 but we also collect answers for the Q i (i=2,..., ) if the pages loaded for Q 1 are also relevant for Q i. After this first step, the query for Q 1 is copletely finished and the answers for all the other objects are partially deterined. To deterine the coplete answers for the other query objects [Q 2,..., Q ] we have to call the ethod repeatedly for [Q 2,..., Q ], [Q 3,..., Q ],..., [Q ]. However, in subsequent calls the partial answers are first restored fro the internal buffer. For instance, the second call for [Q 2,..., Q ] will only consider data pages which are relevant for Q 2 but which have not been processed in the first call. This increental processing of a ultiple siilarity query has the advantage that (partial) answers to all of the queries can be presented to a user at a very early stage of the evaluation. Furtherore, the increental approach is very efficient if an ExploreNeigborhoods-algorith dynaically adds new query objects when processing the answers obtained for previous query objects. Let us assue a first call of DB.ultiple_siilarity_query([Q 1,...,Q ], [T 1,..., T ]). Furtherore, after this call let soe answers A 1,..., A k for the query object Q 1 be inserted into the ControlList of the ExploreNeigborhoods-algorith and let these objects be inserted into the sequence Q of query objects at the beginning of the second execution of the ain loop. Then, the ultiple siilarity query is executed for Q = [Q 2,..., Q, A 1,..., A k ] iplying that now all data pages are considered which have not been processed for object Q 1 but have to be loaded for object Q 2. It is very likely for an ExploreNeighborhoods-algorith that soe of these pages ust also be considered for soe of the objects A i (i=1,..., k). Then, the answers for the objects A i are (partially) collected fro the current data pages deterined by the object Q 2. These pages will not be loaded again when A i becoes the first eleent of Q. If we use a non- 10

11 increental evaluation of a ultiple siilarity query we have to load these pages again, resulting in an overall higher nuber of disk I/Os. For ultiple siilarity queries Q 1,..., Q the I/O cost is proportional to relevant_pages( Q i ) where S denotes the cardinality of a set S. Obviously, an I/O speed-up is achieved if (and only if) there are data pages which are relevant for ore than one query object - ore forally: if i = 1 relevant_pages( Q i ) < i = 1 relevant_pages( Q i ). C I/O C I/O In the case of the linear scan, it holds that =, because relevant_pages(q 1 ) =... = 1 C I/O i = 1 relevant_pages(q ), and therefore, the condition C I/O < 1 C I/O is obviously satisfied. The average I/O cost 1 C I/O for one query object is and the speed-up factor for a ultiple siilarity query copared to single siilarity queries (with respect to disk I/O) is exactly equal to. In the case of a tree-like index structure, the ratio i = 1 relevant_pages( Q i ) relevant_pages( Q i ) which deterines the actual speed-up factor cannot be analytically derived. However, in higher diensions it is very likely that a data page is relevant for ore than one query object, especially if the queries are dynaically generated by an ExploreNeighborhoods-algorith. Therefore, we assue that the condition < is also satisfied in this case, even though we expect the gain of a ultiple siilarity query on top of a tree-like index to be saller copared to an ipleentation for the sequential scan. Note that the perforance of a ultiple siilarity query with respect to the I/O cost is never worse than the perforance of a single query. i = 1 C I/O 1 C I/O 5.2 Reducing CPU Cost The basic idea for reducing the CPU cost is to use the triangle inequality to avoid distance coputations which are the ost expensive operations when evaluating a siilarity query. The proposed approach akes use of the fact that a distance calculation is typically uch ore expensive than a distance coparison. To apply the triangle inequality to avoid distance calculations, we need to know the distances for each pair of query objects (Q i,q j ) which have to be calculated and stored in advance. This coputational overhead is (up to a certain value of depending on the nuber n of database objects) relatively sall copared to the savings of distance coputations by such a preprocessing. Intuitively, there are two cases where the calculation of dist(q j,o) can be avoided for a query object Q j and a database object O if we already know the distance between Q j and a second query object Q i and the distance dist(q i,o) has already been calculated: first, the query objects Q i and Q j are close to each other and dist(q i,o) is large; second, the query objects Q i and Q j have a large distance fro each other and dist(q i,o) is sall. To outline the proposed ethod ore forally, we first define the notion of an avoidable distance calculation in the context of ultiple siilarity queries. 11

12 Definition 5: (avoidable distance calculation) Let Queries=[Q 1,..., Q ] be the query objects for a ultiple siilarity query, O DB, and let l, 1 l <, be a natural nuber. Furtherore, let the values of dist(q i, Q j ) be known for all 1 i, 1 j, and let dist(q i,o) be known for all 1 i l. Let QueryDist(Q i ), 1 i, denote the query distance of Q i in a current execution step of the ultiple siilarity query. Then, we call the calculation of dist(q l+1,o) avoidable with respect to Queries if we can conclude that dist( Q l + 1, O) QueryDist( Q l + 1 ) without having to calculate dist(q l+1,o). To show that a distance calculation is avoidable, we apply the triangle inequality - satisfied by the etric distance function dist - to the triangle defined by two query objects Q 1 and Q 2 and a database object O. We obtain the following three inequalities which hold siultaneously: (1) (2) dist( O, Q 1 ) dist( O, Q 2 ) + dist( Q 2, Q 1 ) dist( O, Q 2 ) dist( O, Q 1 ) + dist( Q 1, Q 2 ) (3) dist( Q 1, Q 2 ) dist( Q 1, O) + dist( O, Q 2 ). Inequality (1) can be used to show the avoidability of the calculation of dist(q 2,O) because it yields a lower bound for dist(q 2,O). This is foralized in the following lea. Lea 1. Let Q 1, Q 2 Objects be query objects and let O Objects be a database object. Let dist be a etric distance function dist: Objects Objects R +. If dist( O, Q 1 ) dist( Q 2, Q 1 ) + QueryDist( Q 2 ) holds, then it follows that dist( Q 2, O) QueryDist( Q 2 ) Proof. We reforulate inequality (1) as follows: dist( O, Q 2 ) dist( O, Q 1 ) dist( Q 2, Q 1 ). By assuption, dist( O, Q 1 ) dist( Q 2, Q 1 ) + QueryDist( Q 2 ). Then, dist( O, Q 1 ) dist( Q 2, Q 1 ) QueryDist( Q 2 ). By exploiting the syetry of dist we derive: dist( Q 2, O) dist( O, Q 1 ) dist( Q 2, Q 1 ) QueryDist( Q 2 ) Figure 5 (left) illustrates a situation where lea 1 holds and the calculation of dist(q 2,O) can be avoided. Inequality (2) is not useful for the purpose of avoiding distance calculations because it yields an upper bound dist(q 1,O) dist(q 1,Q 2 ) + + QueryDist(Q 2 ) dist(q 1,Q 2 ) dist(q 1,O) + + QueryDist(Q 2 ) Q 1 Q 2 Q 1 Q 2 O O QueryDist(Q 2 ) calculated in advance calculated for a previous query calculation can be avoided O: database object Q 1, Q 2 : query objects Figure 5: Illustration of lea 1 and lea 2 12

13 and not a lower bound for dist(q 2,O). Inequality (3), however, can be used analogously to inequality (1) and yields the following lea. Lea 2. Let Q 1, Q 2 Objects be query objects and let O Objects be a database object. Let dist be a etric distance function dist: Objects Objects R +. If dist( Q 2, Q 1 ) dist( O, Q 1 ) + QueryDist( Q 2 ) holds, then it follows that dist( Q 2, O) QueryDist( Q 2 ). Proof. Analogous to proof of lea 1. Figure 5 (right) depicts a case where lea 2 can be applied to avoid the calculation of dist(q 2,O). To conclude the two above leata, figure 6 illustrates the area of database objects O for which the calculation of the distance fro a query object Q 2 can be avoided. For exaple, the calculation of dist(q 2, O 1 ) can be avoided because dist( O 1, Q 1 ) dist( Q 2, Q 1 ) QueryDist( Q 2 ) holds, and the calculation of dist(q 2,O 2 ) can also be avoided because dist( O 2, Q 1 ) dist( Q 2, Q 1 ) + QueryDist( Q 2 ) holds. O 1 qq 2 2 o 3 o 2 q 1 O 2 Q 1 Area of objects O for which calculation of dist(q 2,O) can be avoided Q i : query objects O i : database objects QueryDist(Q 2 ) calculated in advance calculated for a previous query calculation can be avoided Figure 6: Area of database objects for which calculation of dist can be avoided The CPU cost for processing ultiple siilarity queries is given by the following forula ( 1) C CPU = tie( dist) + avoiding_tries tie( coparison) + not_avoided tie(dist) 2 where avoiding_tries denotes the nuber of (successful or not successful) applications of triangle inequalities and not_avoided denotes the nuber of distance calculations which actually have to be perfored. Obviously, this forula contains several application dependent paraeters which can only be deterined experientally. In the worst case, if no distance calculations can be avoided at all, it holds that C CPU. However, we observed that is significantly saller than C CPU if is sall copared to the database size (see section 6 for details). 5.3 Potentials for Parallelization In this section, we will briefly discuss the ipleentation of a ultiple siilarity query on top of a parallel query processor for a shared nothing environent. In such an environent, the data is distributed aong s servers such that the sae siilarity query is perfored on each server in parallel. However, each process has to 1 C CPU > 1 C CPU 13

14 look only at its local part of the data which is s ties saller than the whole database. The counication overhead in this setting is very sall so that a speed-up (copared to a sequential ipleentation) in the order of s C C s can be expected, i.e. the cost for perforing ultiple siilarity queries is reduced to s. The ipleentation of such a parallel query processor is trivial for the linear scan. For a parallel ipleentation, for exaple, of the X-tree see [Ber+ 97]. The transition fro one coputer to s coputers of the sae type also akes s-ties the ain eory available. If we use these additional resources when perforing ultiple siilarity queries in parallel, we can gain a rearkable speed-up factor for M s siilarity queries which is larger than the nuber s of achines. This effect is due to the fact that we can increase the nuber of query objects to be processed siultaneously if we have ore eory to buffer the answers. In a parallel environent each process produces only one s-th of the answer set for a query object on the average. Therefore, instead of evaluating M siilarity queries in blocks of queries on a single achine we can now use blocks of s queries. This eans that the cost for evaluating M queries using parallel ultiple si- M M ilarity queries is equal to copared to ---- C for the sequential ipleentation. Consequently, the speed-up factor for a parallel ultiple query versus a sequential ultiple query is larger than s s s if C s < s C holds. Fro sections 5.1 and 5.2 we know that we can expect this condition to be satisfied at s C CPU least for the I/O cost. We cannot prove that the condition < s C CPU holds for the CPU cost but section 6.2 deonstrates this experientally. Note, however, that even if this condition did not hold, we still would have the noral speed-up factor of s when using parallelization. C 6 Perforance Evaluation We perfored an extensive experiental evaluation of our technique for ultiple siilarity queries using real databases. The first database, part of the so-called Tycho catalogue [Hog 97], was provided by the European Space Agency (ESA) and contains 20-d feature vectors of 1,000,000 stars and galaxies. The second dataset was a large iage database containing 64-d color histogras of 112,000 iages fro TV snapshots. We investigated two extree instances of iterative neighborhood exploration discussed in section 3.2, one with independent queries and another one with highly dependent queries: On the Astronoy database, we tested siultaneous classification of a set of objects. M objects fro the database were chosen randoly and a k-nearest neighbor query was perfored for each of these query objects. On the iage database, we siulated anual data exploration by a nuber of c concurrent users in the following way. We randoly selected a first query object for each of the users and perfored a k-nearest neighbor query for each of the obtaining a total of c k answers. Then we perfored the following loop. While each of the hypothetic users chose one fro his k current answers, for each of the current answers we prefetched their k-nearest neighbors. After restricting the set of answers to the answers of the objects chosen by the users, we continued the loop with these new query objects etc. Thus, in each loop we generated = c k new query objects for which we perfored k-nearest neighbor queries. 14

15 We experiented with a broad range of k values and found that the average cost per k-nearest neighbor query was quite robust to the value of k. All the results reported in the following were obtained for k = 10 (Astronoy database) and k = 20 (iage database) which are typical paraeter values for the respective applications. All experients were perfored on Intel Pentiu II (300 MHz) based workstations running Linux 6.0, each workstation equipped with 128 MBytes of ain eory. Both, the linear scan and the X-tree were ipleented in C++. The block size of the X-tree was set to 32 KBytes and the buffer size was set to 10% of the X-tree size. 6.1 Reduction of I/O Cost We begin by studying the effect of our technique for ultiple siilarity queries on the I/O cost. Figure 7 depicts the average I/O cost per siilarity query with respect to the nuber of ultiple siilarity queries for the Astronoy database as well as for the iage database. For a single siilarity query, the X-tree outperfors the linear scan by a factor of 4.5 and 3.1. For = 100 query objects, however, the average I/O cost of the X-tree is 1.5 and 3.6 ties the average I/O cost of the linear scan. While the enorous reduction of I/O cost (a factor of nearly ) is expected for the linear scan, it is worth noticing that also the average I/O cost of the X-tree is reduced by a factor of 8.7 and 15 for 100 ultiple siilarity queries SCAN (Astronoy DB) X-tree (Astronoy DB) SCAN (Iage DB) X-tree (Iage DB) Average I/O cost (sec) Nuber of ultiple siilarity queries ( ) Figure 7: Average I/O cost per siilarity query 6.2 Reduction of CPU Cost The aount of CPU cost which can be saved when a data object is disqualified on the basis of the triangle inequality depends on the diensionality of the database since the CPU cost for a distance calculation increases with the diensionality whereas the CPU cost for evaluating the triangle inequality is constant. We easured the following average runties on our test databases. On 20-d data objects the CPU cost for calculating the Euclidean distance ( 4,3µsec ) was 52 ties the CPU cost for evaluating a triangle inequality ( 0,082µsec ) and on the 64-d data objects the factor was 155 ( 12,7µsec versus 0,082µsec ). We easured the average CPU cost per query for 10, 20, 40, 50 and 100 ultiple siilarity queries (cf. figure 8). For the linear scan, the average CPU cost for a siilarity query decreases fro 4.3 sec to 0.6 sec on the Astronoy database when increasing fro 1 to 100. This corresponds to a reduction of the CPU cost by a factor of 7.1. On the iage database, the factor of the CPU cost reduction is even 28. This effect can be ex- 15

16 Average CPU cost (sec) SCAN (Astronoy DB) X-tree (Astronoy DB) SCAN (Iage DB) X-tree (Iage DB) Nuber of ultiple siilarity queries ( ) Figure 8: Average CPU cost per siilarity query plained when considering the distribution of the databases: the Astronoy database is alost uniforly distributed, the iage database, however, is highly clustered. The linear scan profits fro clustered databases for the following reason: if the distance coputation for one data object fro a cluster can be avoided it is likely that the distance coputation for all other data objects lying in the sae cluster can also be avoided. For the X-tree, the effect of applying the triangle inequality is less than for the linear scan, it is 2.1 on the Astronoy database as well as on the iage database. The reason for this saller perforance gain of the X-tree is the fact that due to its indexing properties, the X-tree solely investigates data objects which are close to query objects. Since data objects which have a large distance to the query objects - and therefore a high probability to be excluded fro the distance calculation for ost of the query objects - are not considered, the potential for CPU cost reduction is less than for the linear scan. 6.3 Reduction of the total query cost We now consider the effect of our technique for ultiple siilarity queries on the total query cost and deterine the achieved speed-up. For both databases, figure 9 shows the average total query cost as the su of the average I/O cost and the average CPU cost. This can be done since the cost for anaging the query process can Average total query cost (sec) SCAN (Astronoy DB) X-tree (Astronoy DB) SCAN (Iage DB) X-tree (Iage DB) Nuber of ultiple siilarity queries ( ) Figure 9: Average total query cost per siilarity query 16

17 be neglected copared to the I/O cost and CPU cost. As expected, the average total query cost decreases with increasing for the linear scan and the X-tree. An iportant observation we ade is that for 20 (Astronoy database) and 100 (iage database) the total query cost is doinated by the CPU cost when perforing a linear scan. The average query cost of the X-tree was I/O bound for 100. Since the perforance gain is higher for the linear scan, the linear scan outperfors the X-tree for 10 (Astronoy database) and 100 (iage database). Figure 10 depicts the corresponding speed-up. When coparing = 100 to = 1, the linear scan achieves a speed-up of 28 on the Astronoy database and 68 on the iage database. For the X-tree, this speed-up is less due to the saller benefits fro the triangle inequality and the saller reduction of I/O cost. However, we still observe a speed-up of 7.2 on the Astronoy database and 12.1 on the iage database. Note that the speed-up factors are always higher on the iage database. Siilar to section 6.2, this effect can be explained with the distribution of the databases. Speed-up 70 SCAN (Astronoy DB) X-tree (Astronoy DB) 55 SCAN (Iage DB) 50 X-tree (Iage DB) Nuber of ultiple siilarity queries ( ) Figure 10: Speed-up with respect to 6.4 Effects of parallelization We also investigated the achievable speed-up when applying our technique for ultiple siilarity queries on top of a parallel query processor. The setting we used was a shared nothing environent with a TCP/IP network interconnecting 16 servers. For the ipleentation details of a parallel X-tree see [Ber+ 97]. For both databases, we perfored = 100 ultiple k-nearest neighbor queries on a single server and while we increased the nuber of servers (s = 4, 8, 16) we proportionally increased ( = 400, 800, 1600). Our technique of parallelization increases in order to exploit the fact that s ties the ain eory becoes available (see section 5.3). Figure 11 depicts the achieved speed-up per siilarity query coparing parallel ultiple siilarity queries to sequential ultiple siilarity queries. On the Astronoy database, the parallel linear scan achieves a superlinear speed-up using up to 8 servers and a near linear speed-up of 13.4 using 16 servers. For larger server nubers, i.e. also larger nubers of queries, two effects decrease the speed-up: (1) the cost for the coputation of the distance for each pair of query objects and (2) the cost for applying the triangle inequalities for each database object which - in the worst case - is also quadratic in. The second effect is less iportant for the X-tree because the X-tree visits only a considerably 17

Detection of Outliers and Reduction of their Undesirable Effects for Improving the Accuracy of K-means Clustering Algorithm

Detection of Outliers and Reduction of their Undesirable Effects for Iproving the Accuracy of K-eans Clustering Algorith Bahan Askari Departent of Coputer Science and Research Branch, Islaic Azad University,