Evaluating Continuous Nearest Neighbor Queries for Streaming Time Series via Pre-fetching

Size: px

Start display at page:

Download "Evaluating Continuous Nearest Neighbor Queries for Streaming Time Series via Pre-fetching"

Gavin Newman
5 years ago
Views:

1 Evaluating Continuous Nearest Neighbor Queries for Streaming Time Series via Pre-fetching Like Gao Zhengrong Yao X. Sean Wang Department of Information and Software Engineering, George Mason University Mail Stop 4A4, 44 University Drive, Fairfax VA {lgao, zyao, ABSTRACT For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. Such a standing request is called a continuous nearest neighbor query. This paper seeks fast evaluation of continuous queries on large databases. The initial strategy is to use the result of one evaluation to restrict the search space for the next. A more fundamental idea is to extend the existing indexing methods, used in many traditional nearest neighbor algorithms, with pre-fetching. Specifically, pre-fetching is to predict the next value of the stream before it arrives, and to process the query as if the predicted value were the real one in order to load the needed index pages and time series into the allocated cache memory. Furthermore, if the pre-fetched candidates cannot fit into the cache memory, they are stored in a sequential file to facilitate fast access to them. Experiments show that prefetching improves the response time greatly over the direct use of traditional algorithms, even if the caching provided by the operating system is taken into consideration. Categories and Subject Descriptors H.2.4 [Database Management]: Systems Query processing General Terms Algorithms, Experimentation, Performance Keywords Streaming time series, nearest neighbor,continuous query 1. INTRODUCTION Finding the nearest neighbor of streaming time series can be useful in many applications ranging from sensor monitoring to automated online stock analysis. In these applications, a large number of time series, called pattern series, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM 2, November 4 9, 22, McLean, Virginia, USA. Copyright 22 ACM /2/11...$5.. are stored in a database, and the input to be monitored or analyzed takes the form of a streaming time series. At each time position, the system must take the current time series, formed by using the most recent values from the stream, and locate the pattern series in the database that is closest to this current series. This standing request throughout all the time positions is called a continuous nearest neighbor query of streaming time series. When the pattern database is large and the stream data come in fast, the challenge is how to evaluate such a continuous query efficiently, especially in terms of response time. For example, in many control systems, it is critical to fast recognize and detect events from the incoming sensor data that are in form of time series and to provide the situation awareness for the systems to make smart decision and quick reaction. In general, when the number of pattern series is large and the data must be stored in the secondary storage, we can use some multi-dimensional index structures to accelerate the search process. Prior researches have obtained excellent results in this area [1, 8, 2, 5], providing algorithms that greatly outperform the naive sequential scan method. However, the traditional algorithms may not provide good enough solutions in the case where the volume of patterns is relatively large. An example is when the querying processing unit is embedded in a small component deployed in the fields. In such situations, the use of index may involve many disk accesses and still require a long process time. Treating the continuous query at different time positions as independent queries obviously is not the best we can do. In this paper, we study strategies that exploit the characteristics of the continuous queries to achieve better performance, especially shorter response time. Our starting point is to use the traditional time series nearest neighbor algorithms. The basic approach of the traditional algorithms is to use a certain mathematical transform to map pattern time series of the database into low dimensional approximations, each time series being mapped to a point in a low dimensional feature space. These approximations guarantee the distances between the feature points is less than (and hopefully close to) the distances of their corresponding time series. Suppose the distance (called threshold) between a querying time series and a particular pattern series is known. Obviously, a pattern series cannot be the nearest neighbor if its distance lower bound to the querying series (obtained from the low dimensional approximation) is greater than this threshold. All other pattern series, however, need to be considered further, and thus called candidate series. The choice of the approximation and this 485

2 particular pattern series is important for the effectiveness of the algorithm. We follow the above traditional strategy and use special properties of the continuous query. Consider two successive evaluations of the continuous nearest neighbor query. If the successive values of the streaming time series mostly do not change abruptly, then the two time series used in these two evaluations (called querying series) are similar to each other. Therefore, in many cases, the result of the nearest neighbor of the first querying series should be close to that of the second querying series. This continuity can be exploited by evaluation algorithms that may use the distance between the second querying series and the nearest neighbor of the first querying series (i.e., the result of the first evaluation). When the continuity is strong, this distance is likely to exclude many pattern series to be considered as the candidate nearest neighbors for the second querying series. The continuity property also says that the index pages and the candidates accessed by one evaluation are likely to be accessed again in the next evaluation. This property is no stranger to us. In fact, almost all operating systems and databases systems use caching to save the disk pages in the cache/buffer memory for future operations. In our continuous queries, this strategy works especially well when the continuity is strong since the same index pages and candidate time series are likely to be accessed again. The above continuity property can be used directly within the framework of the traditional algorithms. A departure from the traditional algorithms is that we use another important property of the continuous query. This property, called predictability, lies in the fact that in many applications, the values in the streaming time series can be predicted quite well at many, if not most, time positions. In this case, the next querying time series can be predicted and the next evaluation can be attempted even before the next time value arrives. The usefulness of such a dry run is that we may access the index pages and candidate time series for this predicted querying time series, and cache them. When the actual value comes and the next evaluation must be performed, the index pages and candidates are already in the cache and response time is shortened dramatically. We call this as pre-fetching strategy. In addition, during the above dry run, we may not be able to fit all the pre-fetched index pages and candidate series into the cache memory. When this happens, we store back these candidates to the disk. This time, however, since we know they will be very likely accessed when the actual value arrives, we use a sequential file. The benefit of course is that access to this sequential file is much faster than the random access that we would have to do. This will reduce the response time of the continuous query when the available memory for cache is small. To verify the effectiveness of the above strategies, we perform experiments on some stock market data and a large volume of synthetic random walk data, with assuming that caching is always available. We compare the use of continuity versus that without using it, and then the use of prefetching against that without it (but both with continuity property used). All the strategies work well. The remainder of the paper is organized as follows. In Section 2, we present our basic assumptions and definitions. In Section 3, we outline our algorithms, and in Section 4, present our experimental results. We compare with related Symbol S S[i : j] T p Comments streaming time series (querying series). subseries of streaming S between positions i and j, inclusively. querying time series of streaming S at position p, T p = S[p L +1:p]. ˆT p the predicted querying time series of T p. Table 1: Some frequently used symbols. work in Section 5, and then conclude with some discussion. 2. CNNQ FOR STREAMING TIME SERIES In this section, we first introduce the basic concepts and definitions of the continuous nearest neighbor queries (CNNQ), and then discuss some properties of streaming time series and how to use them to improve the performance of CNNQ. Definition A streaming time series, denoted S, isanin- finite real number sequence whose values are obtained by sampling the underlying process at a fixed time interval. Without loss generality, we assume the first value of S is sampled at time position, and the (i + 1)-th value is sampled at time i. IfweuseS[i] to denote the sampled value at time position i, thens = S[],...,S[i],... A subseries of S from time position i to j, inclusively, is a finite series of j i + 1 values, and is denoted S[i : j]. The sampled data arrive at the query process system sequentially. They may come in with the same rate, i.e., with the same speed as the sampling rate, or could arrive with variable speeds. To simplify the illustration, in this paper, we assume that the data arrive at the database system at the same speed, and further, the query processing is fast enough to finish the current evaluation before the next value comes. In our continuous query, given an integer length L, ateach time position p, wherep L 1, the most recent L values will be used to form a subseries, namely S[p L+1 : p], and the system is to find the nearest neighbor of S[p L +1: p] from a database of time series. Here, we use the Euclidean distance to measure the closeness q of two time series x and y PL 1 of length L as D(x, y) = i=o (x[i] y[i])2. Definition Let L>andp L 1 be integers. Let S be a streaming time series. Given a database of N pattern time series O,O 1,...,O N 1, each of which is of length L, pattern O i is said to be the nearest neighbor of S at position p if for all other O j, j i, D(S[p L +1 : p],o i) < D(S[p L +1:p],O j). We call the time series S[p L+1 : p]asthequerying series at position p and denote it by T p,whenl is understood. Table 1 summarizes the most frequently used symbols. Definition Given an integer L>, the continuous nearest neighbor query, or CNNQ for short, of a streaming time series S is a standing request that asks for the nearest neighbor of S at each time position p, p L 1, from the database. In this paper, we use the average response time to measure the performance and define it as follows, Definition Given an integer L>andastreamingtime series S,theresponse time of the continuous nearest neighbor 486

3 query, denoted t r, is the average response time to find one of the nearest neighbors of S, P L 2+n p=l 1 t r = trp, n, n where t rp is the response time to find the nearest neighbor of S at time position p. When there is a large number of time series patterns in the database, the time cost will be mostly involving access to the disk since our computation is mostly straightforward. We assume the index and all the time series are on the disk, and we measure the number of disk pages accessed in respond to a query. As in realistic situations, we further assume that a fixed memory space is reserved for the continuous queries. This fixed memory space is called cache 1. To reduce the response time of the CNNQ for streaming time series, it is important to exploit the characteristics of the streaming time series. Compared to the queries of a group of unrelated time series, many streaming time series display the properties called predictability and continuity. Predictability means that in many applications, the next value in the streaming time series can be predicted. We can use this prediction to make preparations before the value comes. This strategy can reduce the response time of queries. Note that this prediction need not be precise every time. Indeed, if we have enough number of time positions that the prediction is precise, we will win in overall response time. Percentage of successive NN changed Stock Data: 1 5 patterns Pattern Length Figure 1: Continuity shown in few changes of nearest neighbor. Continuity says that the stream is relatively smooth, i.e., one value of the time series is not too far away from the previous. With this continuity, strong correlation may exist between the successive querying time series. This similarity between successive querying time series leads to the similarity between their nearest neighbors. We performed some experiments to see this continuity in real stock market data. For each time position p, we look to see if the nearest neighbor at the position p is the same as that at the previous position. Figure 1 shows the percentage of the positions where a new nearest neighbor appears. Not surprisingly, the percentage goes down as the length L increases since the greater L is, the less of the influence each value has on choice of the nearest neighbor. Also, when the length L is about 1, there is only a small percentage of positions where a new nearest neighbor appears 2. For example, when 1 Another term for this is buffer. 2 This of course only roughly gives the indication of the continuity. Indeed, even if a new nearest neighbor appears, the new nearest neighbor may be very similar to the previous. This is not shown in the figure. we use pattern length of 128, only at 2% of the time positions we see a new nearest neighbor appears. Thus, many times, the answer of the continuous query remains the same. 3. ALGORITHMS FOR CNNQ As mentioned in the introduction, our algorithms are based on the traditional indexing methods. In this section, we first review the traditional nearest neighbor search algorithms, and then give a CNNQ algorithm which is a direct extension of the traditional approaches. After that, we will present the algorithm that uses both prediction and pre-fetching to achieve faster response time. For comparison purpose, we will also introduce the sequential scan method. 3.1 Traditional algorithms Most of the state-of-art nearest neighbor search algorithms are based on the dimensionality reduction and indexing techniques. The dimensionality reduction uses some mathematical transforms (e.g., SVD, DFT, DWT, APCA) [14] and takes part of coefficients from the transformation of a time series. These selected coefficients form the feature of the given time series and each time series then has a representative in the feature space. Most of these transforms guarantee that the distance of any two time series is no less than their distance in the feature space. A high-dimensional index is built on the feature points in the feature space using R-tree, X-tree, KDB-tree, etc. [12, 4]. When a query is issued, it will take several steps to find the nearest neighbor of the querying series [15, 22]. The major steps of such an algorithm is illustrated in Figure 2. Step Action 1. (Nearest neighbor search in feature space): transform the querying series T to f(t ) in the feature space and issue the nearest neighbor search to find the sub-optimal or optimal nearest neighbor of f(t ), denoted NN f ; 2. (Determining the threshold): calculate the real distance from the querying series T to NN f in the original space, denote it TH range; 3. (Range query): use TH range as the range to find all the features whose distances to f(t )arenogreater than TH range, and time series found are called candidates; 4. (Verification): evaluate the actual distances from each candidate to query T in the original space, and find out the actual nearest neighbor. Figure 2: Traditional multi-step algorithm. The first step of the multi-step nearest neighbor search algorithm is costly if the optimal NN f in the feature space is produced, especially when the number of patterns is large and the dimensionality of the feature space is high. A suboptimal nearest neighbor may be found by evaluating with reduced cost [21]. Since the costs of Steps 3 and 4 are determined by the threshold TH range obtained from step 2, TH range should be chosen as smaller as possible. One method, called optimal multi-step algorithm, to reduce the threshold is given in [22]. The method is to dynamically reduce the threshold by sorting all the pattern series according to their distances to f(t ) in the feature space, and then by incrementally fetching the objects. The threshold is dynamically reduced at each step if possible to reduce the candidates without calculating the actual distances. 487

4 3.2 The Direct-Index algorithm As mentioned in Section 2, due to continuity, the nearest neighbor at the previous time position is more likely to be close to the nearest one at current position. This is especially true with long patterns. Therefore, the distance from current querying series to the nearest neighbor at the previous position is more likely to be the smallest threshold, and it may be even smaller than the optimal distance found by the above optimal multi-step algorithm. With this observation, we modify the traditional multistep algorithm by taking the nearest neighbor at previous position into consideration. Specifically, in step 2 of Figure 2, we calculate the distance from the previous nearest neighbor to the current querying series, and choose the smaller between this distance and TH range as the new threshold. Thisvalueisusedinstep3. TheexperimentsinSection 4.1 show that this modification significantly reduces the search space of the range query. In addition to the above strategy of using continuity, we also use caching to further speed up the evaluations. Again as mentioned in the Section 2, due to the continuity property, we can expect that two successive evaluations use some same search space and candidate sets. So it is advantageous to keep the accessed index pages and candidates in the main memory. This kind of caching is naturally achieved by the underlying operating system. In our case, we can assume that the operating system will always allocate a fix size of the memory to cache these pages and candidates. The cache memory can cache not only the index pages and candidates of one evaluation, but also those of the previous ones, as long as they can fit in the allocated memory. A page replacement algorithm, e.g., Least-Recently-Used (LRU) algorithm, can be used to manage the cache memory. Since these resources are cached by the operating system and the query evaluation process does not directly involve into the cache management, this method can be viewed as a passive caching strategy. Combing the two strategies, we summarize this extended algorithm as the Direct-Index algorithm shown in Figure 3. At each time position p of the streaming S, performthe following steps: 1. Form the querying series T p = S[p L +1:p] 2. Find the nearest neighbor of T p as follows: 2.1 Do the Nearest Neighbor Search as step 1 in Figure Determine the threshold as follows: (a) Get TH range as step 2in Figure 2; (b) Find the real distance, d pre, fromt p to the nearest neighbor at previous position p 1; (c) Update threshold TH range =min(th range,d pre) 2.3 Get candidates as step 3 in Figure 2; 2.4 Find the nearest neighbor as step 4 in Figure 2; 3. Report the nearest neighbor; 4. Insert the index pages and candidates of this evaluation into cache memory. Figure 3: The Direct-Index algorithm. Compared with the traditional nearest neighbor algorithms, the Direct-Index algorithm needs less time to find the nearest neighbor of the streaming S at each position. One reason is the reduced search space, and the other is some cached index pages and candidate pattern series. As a result, both the processing time and the response time are shorter. 3.3 The Pre-fetch algorithm The Direct-Index algorithm exploits the continuity of streaming time series to reduce the processing time. We may use the other property, namely predictability, to further reduce the response time, even at time positions where continuity is not strong. This is the basis for the pre-fetch algorithm. As mentioned in Section 2, at many time positions, the next value of the streaming series can be predicted reasonably well. This provides the opportunity to obtain fast response time by using the idle time before the actual value arrives. Instead of caching the resources used by current evaluation, we can load the index pages and candidate series that are more likely needed by the next evaluation into the memory. In order to do so, we perform the Direct-Index algorithm on the predicted querying time series. All the index pages and candidate series accessed in this dry run are all potential targets to be cached and used when the actual value arrives and real evaluation is performed. Hence, the preparation step is for the purpose of loading the needed index page and candidate series. This will be most useful when the continuity of the series is weak, i.e., the two successive querying series are not similar to each other. Indeed, in this case, the index pages and candidate series used by the previous evaluation may not be useful for the next one, while pre-fetching will load the correct (based on prediction) index pages and candidate series. Since this pre-fetching method actively loads the useful resources into the cache and thus canbeviewedasanactive caching strategy. Consider the streaming time series S. At current time position p, the first L 1 values of the next querying series T p+1 have already arrived by our assumption. We only need the next value S[p + 1] to form the entire querying series. In most cases, the value S[p + 1] can be predicted with the real world models and the statistical inference, and the prediction of S[p + 1] is denoted Ŝ[p + 1]. With this prediction, we can get the predicted querying series at next position ˆT p+1 = S[p +2 L],...,S[p], Ŝ[p +1]. One simplest prediction of the S[p +1]is to let Ŝ[p +1]= S[p], i.e., use the previous value as the prediction of the next one. We can easily show the following: Proposition 1. Assume that the prediction of streaming value at position p +1 is Ŝ[p +1]=S[p], andthepredicted time series of T p+1 is ˆT p+1 where ˆT p+1 = S[p L + 2],...,S[p], Ŝ[p +1], thend(tp+1, ˆT p+1) D(T p+1, T p). The proof is straightforward by the definition of the Euclidean distance. This result means that even with this simplest prediction, the predicted querying series is closer to the next querying series than the current querying series does. Therefore, it makes sense to use the predicted querying series to load the indexing pages and candidate series than to use the previous querying time series. When an application has precise prediction of values in the stream, it will be more beneficial to use than the simple mothed given here. Once we have the predicted querying series, we can use them to fetch the index pages and candidates by evaluating a nearest neighbor query and a range query, i.e., step 2 in Figure 3. These pages will stay in the cache memory to be used when the actual value arrives. For large databases, the pre-fetched index pages and candidates may be too large to fit in the cache memory, while these pages are very likely be used soon. A solution is to take advantage of the sequential files for the overflowed can- 488

5 At each position p of a streaming time series, 1. Finding the nearest neighbor of S at position p 1.1 Form the querying series T p = S[p L +1:p]. 1.2Find the nearest neighbor of T p: (a) Do the nearest neighbor search in feature space as step 1 in Figure 2. (b) Determine the threshold as follows: (1) Get TH range as step 2.2 in Figure 3; (2) Get the smallest distance d cad from T p to the candidates cached by the previous evaluations: - if the candidate is in the cache memory, use it directly; - otherwise read it from the sequential file. (3) Update threshold TH range = min(th range,d cad ). (c) Get the candidates as step 3 in Figure 2. (d) Exclude the candidates that are used in 1.2(b.2), except for the one that yields d cad. (e) Find the nearest neighbor as step 4 in Figure Report the nearest neighbor. 2. Pre-fetching for the next evaluation at position p 2.1 Predict the data at position p +1, Ŝ[p + 1], and form the predicted querying series: ˆT p+1 = S[p L 2,...,S[p], Ŝ[p +1]] ; 2.2 Pre-fetch index pages into cache memory: (a) Do the nearest neighbor search with ˆT p+1 in feature space. (b) Determine the threshold as step 2.2 in Figure 3. (c) Get candidates as step 3 in Figure 2. (d) Insert the index pages accessed in step 2.2(a) and 2.2(c) into cache. 2.3 Pre-fetch each candidate into cache memory: -if the cache memory is not full, insert the candidate into the cache; -otherwise insert it into a sequential file. didate series. When pre-fetched candidates cannot fit in the cache, we store them into a sequential file. Once the next evaluation begins, this file will be read in sequentially as the first step of evaluation, to calculate the distance between the querying series and these candidates. When the threshold is needed for the range query step of the algorithm (like in step 4 of Figure 2 or step 2.4 of Figure 3), three distances are compared, namely, (1) the least distance from the above calculations, (2) the distance from the querying series to the nearest neighbor at previous position, and (3) the distance from the querying series to the nearest neighbor found in the feature space. The least of the three distances is then used as the threshold to issue the range query. Compared with random accesses, sequential files saves the I/O cost. Therefore, we can expect that this strategy helps reduce the response time when the cache memory is small. We do not use sequential files to store overflowed index pages from the dry run since the access to these index pages in the next evaluation is likely in a random pattern, and the advantage of the sequential file is not obvious. Thus, in the cache memory, we reserve a part for index pages and the remaining for the candidates. Combining all the above strategies, the Pre-fetch algorithm is shown in Figure 4. Algorithm Scan Description Cache candidates and all feature points. If the cache is not large enough, store the un-cached feature points in a sequential file. Direct-Index Cache the pages and candidates of the previous evaluations. Pre-fetch Predict the next querying series, evaluate with this predicted series and cache the index pages and candidates of it. If the cache is not large enough, store the un-cached candidates in a sequential file. Table 2: Summary of CNNQ algorithms For comparison purposes, we also use a sequential scan algorithm to deal with the continuous queries, which is named the Scan algorithm. All these three algorithms to evaluate the nearest neighbor queries of streaming time series are summarized in Table 2. Figure 4: The Pre-fetch algorithm. 4. PERFORMANCE EVALUATION Our goal of this section is to evaluate the performance of the proposed Pre-fetch algorithm, as compared with the Direct-Index and Scan algorithms. Two types of data are used in the experiments, one real data set and two synthetic data sets. The real data set consists of American stock daily prices, with 7,6 original price series. We use 7,5 of them to generate 1, pattern series (by picking up their subseries) with length of 128 and 256, respectively. Each value of the series is stored as float number, and thus the two pattern datasets are about 5MB and 1MB, respectively. We randomly pick one of the remaining 1 stock series as the streaming time series(queries), with a length of 124 values. Although the stock streaming data come in very slowly (daily), we intend to see the impact of the algorithms with the real world data and thus assume these data arrive fast in the experiment. The two synthetic data sets, each consisting of 1,, pattern series, are generated with a random walk function, v[k] =v[k 1] + 1 RAND[k], while v[k] isthekth value ofthetimeseriesandrand[k] is a random number that is uniformly distributed on the interval (-.5,.5). The pattern lengths in the two data sets are 128 and 256, respectively. Hence the corresponding sizes of the data sets are 5MB and 1GB. The streaming time series with length of 124 is independently generated by the same random walk function. We use the DWT function to perform the dimensionality reduction and use R-tree to index these transformed data points. We choose the index dimensionality from 4, 6, 8 or 1, and the page size from 124, 248 or 496. The experiments are performed with all combinations of these dimensions and the page sizes, but we only show the results with index dimensionality of 6 and page size of 496. Since the response time is dominated by the I/O operations for the CNNQ, we use the page faults as the measure of response time. In the experiments, we trace the page faults generated during each step of the algorithm. The same page accessed in one evaluation is only counted once, although it 489

6 may be accessed several times. One page fault is regarded as one random page access, while the the page faults generated by a sequential scan is normalized 3 by dividing the number of accessed pages by 1. The experiments are coded in the programming language C++ and are performed on a dedicated desktop computer. Since we do not record the wall clock time, and only trace the pages that are accessed, the system environment does not make much difference to the results of the experiments. 4.1 Advantage over traditional algorithms Before we move on to the main task of this section, we report our experiments that show how the use of continuity helps the performance of our algorithms. As mentioned in Section 3, all our algorithms for CNNQ use the result of previous evaluation to limit the search space of the current evaluation (see step 2.2 of Figure 3, and step 1.2(b.2) of Figure 4). To see the importance of this strategy, we compare the algorithms with and without the corresponding steps. The difference can be seen from the number of leaves of the index that need to be accessed by these different versions. # of Candidate Leaves(Fanout =.5) TRAD : PSize=124 TRAD : PSize=248 TRAD : PSize=496 CNNQ: PSize=124 CNNQ: PSize=248 CNNQ: PSize= Index Dimensionality # of Candidate Leaves(Fanout =.5) TRAD : PSize=124 TRAD : PSize=248 TRAD : PSize=496 CNNQ: PSize=124 CNNQ: PSize=248 CNNQ: PSize= Index Dimensionality Figure 5: The candidate leaves accessed: Stock Data. (TRAD=traditional algorithms) We performed the experiments on measuring the number of the candidate leaves, the pages that contain the candidate pattern times series. The results with stock data are shown in Figure 5. The number of the candidate leaves for CNNQ algorithms is much less than the traditional ones. Note the graphs in the figures may not directly show the number of page faults for accessing the candidate pattern series. One leaf of the lower dimensional index may hold more feature points than the number of time series that can be held in one page (same size as the leaf). Thus, the corresponding number of page faults will be higher than the number of candidate leaves. Of course, given the data size of the pattern series, we can use Figure 5 to estimate the number of page faults for obtaining the candidate pattern series, especially when all the pattern series indexed by one leaf are clustered on the disk. The same experiments are also performed with the random walk data, which give the similar results. 4.2 Performance of CNNQ algorithms In a large database, the I/O operations dominate the response time. There are two types of page faults that will occur for the CNNQ algorithms. One is the index page faults, which occur when to search the index. The other one is the verification page faults whichoccurwhentofetchthe candidate pattern series to perform the verification. 3 This responds to the fact discovered in many experiments that accessing to 1% of the pages randomly is similar in terms of I/O time to scanning the whole 1% of the pages. In order to better analyze how the algorithms will perform under different sizes of cache memory, we first compare the index page faults and the verification page faults separately. After that, we will give the overall page fault comparison. The overall page faults will determine the response time that each algorithm can achieve. In these experiments, we use the LRU strategy to manage the cache. First we study the index page faults of the three algorithms: Scan, Direct-Index and Pre-fetch. Theresultsfor the stock data set are shown in Figure 6(a), with index dimensionality of 6 and pattern length of 128 and 256, respectively. In this experiment, we give different sizes of cache memory where the index pages fetched by one evaluation of one algorithm could be stored. We issue the nearest neighbor search of streaming series with Direct-Index and Prefetch algorithms. Since we are interesting in the long term index page fault measure, the initial stage of the query is ignored because there are too little index pages being cached. After the initial stage, we record the index page faults at each position and take the average as the measure. Note the Scan algorithm does not use the index at all, and the index page faults of it are determined by the pattern numbers and index dimensionality. We calculate the index page faults of the Scan algorithm for each data set instead of really doing experiment to get these values. From Figure 6(a), we can see that the index page faults decrease to zero with Pre-fetch when the cache memory size increases to about 75KB and 7KB, for length of 128 and 256 respectively. The performance of the Scan algorithm is linear to the cache memory size, and the index page faults will decrease to zero as the cache memory size is greater than 2.4MB since all feature points are stored in the cache. Note in case that the cache size is small, the overhead of the Scan algorithm is so huge that the page faults can not be shown in Figure 6(a). The index page faults, for both Direct-Index and Pre-fetch algorithms, drop quickly when the cache size is near 5KB, which will contain about 12 index pages. When the cache memory cannot even store the index pages for one evaluation, the I/O overhead will be huge. As the cache memory increases large enough, that is, greater than 7KB, both Direct-Index and Pre-fetch algorithms can work very well. But we can see that the Pre-fetch can achieve near zero page faults, while the Direct-Index still has about one page fault even the cache becomes bigger. Although pre-fetching the index pages can help to reduce the page faults, the difference between Pre-fetch and Direct- Index is not very obvious, especially for the stock data. The reason is two-fold: one is that the querying series is very similar to the previous one, and their corresponding feature points are even closer. The other is that each index page contains many feature points, so the index page accessed with Direct-Index algorithm also has the high possibility to be visited in the next evaluation again. The similar study is carried out for page faults generated from the verification step (step 2.4 in Figure 3, and step 1.2(b.2) and 1.2(e) in Figure 4). In these experiments, we assume that the pattern series are stored in the database randomly, and each time one candidate is fetched from disk will result in one page fault. If the candidate is already in the memory, there will be no page fault. For the Prefetch algorithm, all cached candidates are accessed first (step 1.2(b.2) in Figure 4). These candidates may be accessed again in the subsequent steps, but no page fault is generated. 49

7 SCAN 25 SCAN Index Page Faults Index Page Faults Verification Page Faults Verification Page Faults Overall Page Faults Overall Page Faults (a) Index Page Faults (b) Verification Page Faults (c) Overall Page Faults Figure 6: The page faults: Stock Data. The verification page faults with stock data set are shown in Figure 6(b). From these graphs, we can observe that the Pre-fetch algoirthm outperforms the Direct-Index algorithm even when the cache memory is small. For example, in Figure 6(b) with the pattern length L = 128 and without cache memory, the Pre-fetch generates only 18 verification page faults while the Direct-Index has more than 8 faults. When the cache size is 5KB, the Pre-fetch still outperforms the Direct-Index with 1 page faults. Note the difference between the two algorithms is relatively smaller when the pattern length is 256, since the continuity is much higher when the pattern length is larger. This property is addressed in the Section 2. We do not include the Scan algorithm in this experiment since these page faults generated by it must be the same as the Direct-Index algorithm does. We are now ready to report the overall page faults for each algorithm. Since the Scan algorithm has too much overhead in searching the index while the verification page faults are the same as the Direct-Index algorithm, we just compare the other two algorithms, Pre-fetch and Direct-Index. The overall page fault curves are found with the optimal assignment of the cache memory into two parts, one for index pages and the other for candidates. Given a cache memory that can be used to cache either index pages or candidates, the optimal assignment is to let the summation of index page faults and verification page faults be the smallest one among all the possibilities of taking any part of this cache to store index, and the remainder to store candidates. The overall page faults with the stock data is shown in Figure 6(c) 4. From the graphs, we can see that even without any cache 5,thePre-fetch algorithm only generates about half the page faults of that by the Direct-Index algorithm. When the memory size becomes.5mb, with pattern length is 128 and 256, respectively, the differences are 35 and 43 correspondingly. When the memory size is large enough, both Pre-fetch and Direct-Index can achieve the similar performance with nearly no page fault. The experiments with the random walk data yield the similar results as that with stock data. The overall page 4 When compare the page faults in the graph, note the gap of the curves is much bigger than it may appear. Indeed, the two bands seem to be close to each but there is a big difference in vertical values at the the same cache size. 5 When we say without any cache, we mean that all the contents in the working memory will need to be fetched from the disk, generating page faults, either through the use of sequential file or from the original locations. Overall Page Faults Overall Page Faults Figure 7: The overall page faults: Random Walk Data. faults with the random walk data are illustrated in Figure 7. Clearly, given the same cache memory, the Pre-fetch algorithm outperforms the Direct-Index algorithm greatly. From another point of view, that is, how large the cache is needed in order to reach the near zero page fault, the Pre-fetch algorithm can get this goal with less than 1MB or 15MB, with pattern length equal to 128 or 256. But the Direct-Index algorithm needs about 13MB or 2MB to get the similar performance. 5. RELATED WORK The topic of this paper falls into the general category of continuous queries, which have long been considered useful. In 1992, Terry et al. [24] introduced the notion of continuous queries. They proposed an incremental approach to evaluate the queries on append-only databases. Similarly, Parker et al. (e.g., [19]) considered queries on data streams. Continuous queries have also appeared in Liu et al. [16] as continual queries, where more general scenarios are considered. Recently, due to the growing needs from new applications, continuous queries become an increasingly important research subject. Chen et al. [7, 6] designed the NiagraCQ system in which the incremental query evaluation method is no longer restricted to append-only data sources. Babu and Widom [2], and Madden and Franklin [17] also reported system architecture and related issues for continuous queries. Gehrke [11] et al. proposed a single-pass technique for computing correlated aggregation of data stream. In this paper, we study a special type of continuous query, namely nearest neighbor queries of the streaming time series. In our prior research [9], we considered the similar continuous nearest and near neighbor queries with the assumption that the number of pattern series is small enough so that all the pattern series can fit in the memory. We proposed a prediction-based batch processing method which yields su- 491

8 perior algorithms compared with a direct in-memory scan method. When the number of pattern series is large, the in-memory strategies do not work well. In this paper, we proposed the CNNQ algorithms to deal with this situation. In our latest work [1], given different stream rates and other constrains, we implemented the Direct-Index and Pre-fetch algorithms and compared their performance in terms of drop ratio and response time. Those experiments also proved that Pre-fetch algorithm outperforms Direct-Index method. Another related research area is the near and nearest neighbor queries of time series. The general approaches, see [1, 8, 2, 5], are to map the time series into the frequency domain, and then index the significant part of the coefficients using certain high-dimensional indexing structure. (A recent survey about high-dimensional indexes can be found in [3].) However, the research has concentrated on the query of individual time series. In a sense, this paper can be viewed as an extension of the indexing work to the continuous query scenario. Pre-fetching is a well-known approach for improving performance of file systems. People have done a lot of great work in this field, e.g., [25, 13, 23] and a recent survey can be found at [18]. However, as far as we know, this paper is the first one that uses pre-fetching for continuous queries. 6. CONCLUSION In this paper, we studied the continuous nearest neighbor query for a streaming time series. We extended the traditional index-based algorithms to take advantage of the continuity and predictability of the streaming series. With continuity, the traditional algorithms can be improved immediately with great improvements. We made a step further and used caching and prediction to prepare for the next evaluation to reduce the response time. We reported experimental results that showed the effectiveness of our algorithms. The contribution of the paper is in its use of properties of the continuous nearest neighbor query to achieve better performance. The techniques used, namely caching and prefetching, should be generally applicable to many different types of continuous queries. Especially interesting is the use of pre-fetching with the sequential file technique. This technique is most useful in many situations where the cache memory is small but response time requirement is stringent. Intuitively, there is a trade-off between response time and how much time we use to prepare for the actual evaluation. Generally, the more time we have to prepare, the faster the response time will be. However, in some applications, we do not always have the luxury to become well prepared for the next evaluation. It will be interesting to study a system that can adapt itself in this regard. That is, use as much time as the situation allows to prepare for the next evaluation, but can stop at any moment to respond to the actual evaluation request. Another interesting research direction is to study the impact of the continuities on query performance, that is, when the characteristics of the query streaming time series are subject to variation over time, how the corresponding response times change with these algorithms. In this paper, we used the R-tree [12] for our basic implementation of the nearest neighbor search algorithm. However, our strategies are not restricted to R-tree. It is not difficult to see that the strategies can be implemented in many other indexing structures such as X-tree and M-tree, etc. We chose LRU strategy here to manage the cache, while other strategies can also work with CNNQ algorithms and may yield better performance. Although we only dealt with the nearest neighbor search in the paper, it is easy to lift our algorithms for the k-nearest neighbor and range queries for streaming time series. 7. REFERENCES [1] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In FODO, [2] S. Babu and J. Widom. Continuous queries over data streams. In SIGMOD Record, Sept. 21, 21. [3] S. Berchtold and D. A. Keim. High-dimensional index structures, database support for next decade s applications (tutorial). In ACM SIGMOD, [4] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree: An index structure for high-dimensional data. In VLDB, pages 28 39, [5] K.-P. Chan and A. W.-C. Fu. Efficient time series matching by wavelets. In ICDE, pages , [6] J. Chen, D. J. DeWitt, and J. F. Naughton. Design and evaluation of alternative selection placement strategies in optimizing continuous queries. In ICDE, 22. [7] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for Internet databases. In ACM SIGMOD, pages , 2. [8] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In ACM SIGMOD, pages , [9] L. Gao and X. S. Wang. Continually evaluating similarity-based pattern queries. In SIGMOD, 22. [1] L. Gao and X. S. Wang. Improving the performance of continuous queries on fast data streams: Time series case. In DMKD, 22. [11] J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD, 21. [12] A. Guttman. R trees: A dynamic index structure for spatial searching. In ACM SIGMOD, pages 47 57, [13] H. Seok Jeon and Sam H. Noh. A database disk buffer management algorithm based on prefetching. In CIKM, pages , [14] E. J. Keogh, K. Chakrabarti, S. Mehrotra, and M. J. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. In SIGMOD, 21. [15] F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. Fast nearest neighbor search in medical image databases. In VLDB, pages , [16] L. Liu, C. Pu, and W. Tang. Continual queries for Internet scale event-driven information delivery. IEEE TKDE, 11(4):61 628, [17] S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In ICDE, 22. [18] N. Oren. A survey of prefetching techniques. [19] D. S. Parker, R. R. Muntz, and H. Lewis Chau. The tangram stream query processing system. In ICDE, [2] D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. In ACM SIGMOD, pages 13 25, [21] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, pages 71 79, [22] T. Seidl and H.-P. Kriegel. Optimal multi-step k-nearest neighbor search. In SIGMOD, pages , [23] A. Sinha and C. M. Chase. Prefetching and caching for query scheduling in a special class of distributed applications. In ICPP, Vol. 3, pages 95 12, [24] D.Terry,D.Goldberg,D.Nichols,andB.Oki.Continuous queries over append-only databases. In ACM SIGMOD, pages , [25] J. S. Vitter and P. Krishnan. Optimal prefetching via data compression. Journal of the ACM, 43(5): ,

0.9. Percentage of successive NN changed. Stock Data: 10 5 patterns Pattern Length

0.9. Percentage of successive NN changed. Stock Data: 10 5 patterns Pattern Length !#"%$ &'(*),+%(-./12$ +3+4356(7$98:.;#.=?@AB+DCE+ F#.GAB $9H I.B(J2KML6N*OQPPRSR2TVUUUU ;