Evaluating Continuous Nearest Neighbor Queries for Streaming Time Series via Pre-fetching

Size: px
Start display at page:

Download "Evaluating Continuous Nearest Neighbor Queries for Streaming Time Series via Pre-fetching"

Transcription

1 Evaluating Continuous Nearest Neighbor Queries for Streaming Time Series via Pre-fetching Like Gao Zhengrong Yao X. Sean Wang Department of Information and Software Engineering, George Mason University Mail Stop 4A4, 44 University Drive, Fairfax VA {lgao, zyao, ABSTRACT For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. Such a standing request is called a continuous nearest neighbor query. This paper seeks fast evaluation of continuous queries on large databases. The initial strategy is to use the result of one evaluation to restrict the search space for the next. A more fundamental idea is to extend the existing indexing methods, used in many traditional nearest neighbor algorithms, with pre-fetching. Specifically, pre-fetching is to predict the next value of the stream before it arrives, and to process the query as if the predicted value were the real one in order to load the needed index pages and time series into the allocated cache memory. Furthermore, if the pre-fetched candidates cannot fit into the cache memory, they are stored in a sequential file to facilitate fast access to them. Experiments show that prefetching improves the response time greatly over the direct use of traditional algorithms, even if the caching provided by the operating system is taken into consideration. Categories and Subject Descriptors H.2.4 [Database Management]: Systems Query processing General Terms Algorithms, Experimentation, Performance Keywords Streaming time series, nearest neighbor,continuous query 1. INTRODUCTION Finding the nearest neighbor of streaming time series can be useful in many applications ranging from sensor monitoring to automated online stock analysis. In these applications, a large number of time series, called pattern series, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM 2, November 4 9, 22, McLean, Virginia, USA. Copyright 22 ACM /2/11...$5.. are stored in a database, and the input to be monitored or analyzed takes the form of a streaming time series. At each time position, the system must take the current time series, formed by using the most recent values from the stream, and locate the pattern series in the database that is closest to this current series. This standing request throughout all the time positions is called a continuous nearest neighbor query of streaming time series. When the pattern database is large and the stream data come in fast, the challenge is how to evaluate such a continuous query efficiently, especially in terms of response time. For example, in many control systems, it is critical to fast recognize and detect events from the incoming sensor data that are in form of time series and to provide the situation awareness for the systems to make smart decision and quick reaction. In general, when the number of pattern series is large and the data must be stored in the secondary storage, we can use some multi-dimensional index structures to accelerate the search process. Prior researches have obtained excellent results in this area [1, 8, 2, 5], providing algorithms that greatly outperform the naive sequential scan method. However, the traditional algorithms may not provide good enough solutions in the case where the volume of patterns is relatively large. An example is when the querying processing unit is embedded in a small component deployed in the fields. In such situations, the use of index may involve many disk accesses and still require a long process time. Treating the continuous query at different time positions as independent queries obviously is not the best we can do. In this paper, we study strategies that exploit the characteristics of the continuous queries to achieve better performance, especially shorter response time. Our starting point is to use the traditional time series nearest neighbor algorithms. The basic approach of the traditional algorithms is to use a certain mathematical transform to map pattern time series of the database into low dimensional approximations, each time series being mapped to a point in a low dimensional feature space. These approximations guarantee the distances between the feature points is less than (and hopefully close to) the distances of their corresponding time series. Suppose the distance (called threshold) between a querying time series and a particular pattern series is known. Obviously, a pattern series cannot be the nearest neighbor if its distance lower bound to the querying series (obtained from the low dimensional approximation) is greater than this threshold. All other pattern series, however, need to be considered further, and thus called candidate series. The choice of the approximation and this 485

2 particular pattern series is important for the effectiveness of the algorithm. We follow the above traditional strategy and use special properties of the continuous query. Consider two successive evaluations of the continuous nearest neighbor query. If the successive values of the streaming time series mostly do not change abruptly, then the two time series used in these two evaluations (called querying series) are similar to each other. Therefore, in many cases, the result of the nearest neighbor of the first querying series should be close to that of the second querying series. This continuity can be exploited by evaluation algorithms that may use the distance between the second querying series and the nearest neighbor of the first querying series (i.e., the result of the first evaluation). When the continuity is strong, this distance is likely to exclude many pattern series to be considered as the candidate nearest neighbors for the second querying series. The continuity property also says that the index pages and the candidates accessed by one evaluation are likely to be accessed again in the next evaluation. This property is no stranger to us. In fact, almost all operating systems and databases systems use caching to save the disk pages in the cache/buffer memory for future operations. In our continuous queries, this strategy works especially well when the continuity is strong since the same index pages and candidate time series are likely to be accessed again. The above continuity property can be used directly within the framework of the traditional algorithms. A departure from the traditional algorithms is that we use another important property of the continuous query. This property, called predictability, lies in the fact that in many applications, the values in the streaming time series can be predicted quite well at many, if not most, time positions. In this case, the next querying time series can be predicted and the next evaluation can be attempted even before the next time value arrives. The usefulness of such a dry run is that we may access the index pages and candidate time series for this predicted querying time series, and cache them. When the actual value comes and the next evaluation must be performed, the index pages and candidates are already in the cache and response time is shortened dramatically. We call this as pre-fetching strategy. In addition, during the above dry run, we may not be able to fit all the pre-fetched index pages and candidate series into the cache memory. When this happens, we store back these candidates to the disk. This time, however, since we know they will be very likely accessed when the actual value arrives, we use a sequential file. The benefit of course is that access to this sequential file is much faster than the random access that we would have to do. This will reduce the response time of the continuous query when the available memory for cache is small. To verify the effectiveness of the above strategies, we perform experiments on some stock market data and a large volume of synthetic random walk data, with assuming that caching is always available. We compare the use of continuity versus that without using it, and then the use of prefetching against that without it (but both with continuity property used). All the strategies work well. The remainder of the paper is organized as follows. In Section 2, we present our basic assumptions and definitions. In Section 3, we outline our algorithms, and in Section 4, present our experimental results. We compare with related Symbol S S[i : j] T p Comments streaming time series (querying series). subseries of streaming S between positions i and j, inclusively. querying time series of streaming S at position p, T p = S[p L +1:p]. ˆT p the predicted querying time series of T p. Table 1: Some frequently used symbols. work in Section 5, and then conclude with some discussion. 2. CNNQ FOR STREAMING TIME SERIES In this section, we first introduce the basic concepts and definitions of the continuous nearest neighbor queries (CNNQ), and then discuss some properties of streaming time series and how to use them to improve the performance of CNNQ. Definition A streaming time series, denoted S, isanin- finite real number sequence whose values are obtained by sampling the underlying process at a fixed time interval. Without loss generality, we assume the first value of S is sampled at time position, and the (i + 1)-th value is sampled at time i. IfweuseS[i] to denote the sampled value at time position i, thens = S[],...,S[i],... A subseries of S from time position i to j, inclusively, is a finite series of j i + 1 values, and is denoted S[i : j]. The sampled data arrive at the query process system sequentially. They may come in with the same rate, i.e., with the same speed as the sampling rate, or could arrive with variable speeds. To simplify the illustration, in this paper, we assume that the data arrive at the database system at the same speed, and further, the query processing is fast enough to finish the current evaluation before the next value comes. In our continuous query, given an integer length L, ateach time position p, wherep L 1, the most recent L values will be used to form a subseries, namely S[p L+1 : p], and the system is to find the nearest neighbor of S[p L +1: p] from a database of time series. Here, we use the Euclidean distance to measure the closeness q of two time series x and y PL 1 of length L as D(x, y) = i=o (x[i] y[i])2. Definition Let L>andp L 1 be integers. Let S be a streaming time series. Given a database of N pattern time series O,O 1,...,O N 1, each of which is of length L, pattern O i is said to be the nearest neighbor of S at position p if for all other O j, j i, D(S[p L +1 : p],o i) < D(S[p L +1:p],O j). We call the time series S[p L+1 : p]asthequerying series at position p and denote it by T p,whenl is understood. Table 1 summarizes the most frequently used symbols. Definition Given an integer L>, the continuous nearest neighbor query, or CNNQ for short, of a streaming time series S is a standing request that asks for the nearest neighbor of S at each time position p, p L 1, from the database. In this paper, we use the average response time to measure the performance and define it as follows, Definition Given an integer L>andastreamingtime series S,theresponse time of the continuous nearest neighbor 486

3 query, denoted t r, is the average response time to find one of the nearest neighbors of S, P L 2+n p=l 1 t r = trp, n, n where t rp is the response time to find the nearest neighbor of S at time position p. When there is a large number of time series patterns in the database, the time cost will be mostly involving access to the disk since our computation is mostly straightforward. We assume the index and all the time series are on the disk, and we measure the number of disk pages accessed in respond to a query. As in realistic situations, we further assume that a fixed memory space is reserved for the continuous queries. This fixed memory space is called cache 1. To reduce the response time of the CNNQ for streaming time series, it is important to exploit the characteristics of the streaming time series. Compared to the queries of a group of unrelated time series, many streaming time series display the properties called predictability and continuity. Predictability means that in many applications, the next value in the streaming time series can be predicted. We can use this prediction to make preparations before the value comes. This strategy can reduce the response time of queries. Note that this prediction need not be precise every time. Indeed, if we have enough number of time positions that the prediction is precise, we will win in overall response time. Percentage of successive NN changed Stock Data: 1 5 patterns Pattern Length Figure 1: Continuity shown in few changes of nearest neighbor. Continuity says that the stream is relatively smooth, i.e., one value of the time series is not too far away from the previous. With this continuity, strong correlation may exist between the successive querying time series. This similarity between successive querying time series leads to the similarity between their nearest neighbors. We performed some experiments to see this continuity in real stock market data. For each time position p, we look to see if the nearest neighbor at the position p is the same as that at the previous position. Figure 1 shows the percentage of the positions where a new nearest neighbor appears. Not surprisingly, the percentage goes down as the length L increases since the greater L is, the less of the influence each value has on choice of the nearest neighbor. Also, when the length L is about 1, there is only a small percentage of positions where a new nearest neighbor appears 2. For example, when 1 Another term for this is buffer. 2 This of course only roughly gives the indication of the continuity. Indeed, even if a new nearest neighbor appears, the new nearest neighbor may be very similar to the previous. This is not shown in the figure. we use pattern length of 128, only at 2% of the time positions we see a new nearest neighbor appears. Thus, many times, the answer of the continuous query remains the same. 3. ALGORITHMS FOR CNNQ As mentioned in the introduction, our algorithms are based on the traditional indexing methods. In this section, we first review the traditional nearest neighbor search algorithms, and then give a CNNQ algorithm which is a direct extension of the traditional approaches. After that, we will present the algorithm that uses both prediction and pre-fetching to achieve faster response time. For comparison purpose, we will also introduce the sequential scan method. 3.1 Traditional algorithms Most of the state-of-art nearest neighbor search algorithms are based on the dimensionality reduction and indexing techniques. The dimensionality reduction uses some mathematical transforms (e.g., SVD, DFT, DWT, APCA) [14] and takes part of coefficients from the transformation of a time series. These selected coefficients form the feature of the given time series and each time series then has a representative in the feature space. Most of these transforms guarantee that the distance of any two time series is no less than their distance in the feature space. A high-dimensional index is built on the feature points in the feature space using R-tree, X-tree, KDB-tree, etc. [12, 4]. When a query is issued, it will take several steps to find the nearest neighbor of the querying series [15, 22]. The major steps of such an algorithm is illustrated in Figure 2. Step Action 1. (Nearest neighbor search in feature space): transform the querying series T to f(t ) in the feature space and issue the nearest neighbor search to find the sub-optimal or optimal nearest neighbor of f(t ), denoted NN f ; 2. (Determining the threshold): calculate the real distance from the querying series T to NN f in the original space, denote it TH range; 3. (Range query): use TH range as the range to find all the features whose distances to f(t )arenogreater than TH range, and time series found are called candidates; 4. (Verification): evaluate the actual distances from each candidate to query T in the original space, and find out the actual nearest neighbor. Figure 2: Traditional multi-step algorithm. The first step of the multi-step nearest neighbor search algorithm is costly if the optimal NN f in the feature space is produced, especially when the number of patterns is large and the dimensionality of the feature space is high. A suboptimal nearest neighbor may be found by evaluating with reduced cost [21]. Since the costs of Steps 3 and 4 are determined by the threshold TH range obtained from step 2, TH range should be chosen as smaller as possible. One method, called optimal multi-step algorithm, to reduce the threshold is given in [22]. The method is to dynamically reduce the threshold by sorting all the pattern series according to their distances to f(t ) in the feature space, and then by incrementally fetching the objects. The threshold is dynamically reduced at each step if possible to reduce the candidates without calculating the actual distances. 487

4 3.2 The Direct-Index algorithm As mentioned in Section 2, due to continuity, the nearest neighbor at the previous time position is more likely to be close to the nearest one at current position. This is especially true with long patterns. Therefore, the distance from current querying series to the nearest neighbor at the previous position is more likely to be the smallest threshold, and it may be even smaller than the optimal distance found by the above optimal multi-step algorithm. With this observation, we modify the traditional multistep algorithm by taking the nearest neighbor at previous position into consideration. Specifically, in step 2 of Figure 2, we calculate the distance from the previous nearest neighbor to the current querying series, and choose the smaller between this distance and TH range as the new threshold. Thisvalueisusedinstep3. TheexperimentsinSection 4.1 show that this modification significantly reduces the search space of the range query. In addition to the above strategy of using continuity, we also use caching to further speed up the evaluations. Again as mentioned in the Section 2, due to the continuity property, we can expect that two successive evaluations use some same search space and candidate sets. So it is advantageous to keep the accessed index pages and candidates in the main memory. This kind of caching is naturally achieved by the underlying operating system. In our case, we can assume that the operating system will always allocate a fix size of the memory to cache these pages and candidates. The cache memory can cache not only the index pages and candidates of one evaluation, but also those of the previous ones, as long as they can fit in the allocated memory. A page replacement algorithm, e.g., Least-Recently-Used (LRU) algorithm, can be used to manage the cache memory. Since these resources are cached by the operating system and the query evaluation process does not directly involve into the cache management, this method can be viewed as a passive caching strategy. Combing the two strategies, we summarize this extended algorithm as the Direct-Index algorithm shown in Figure 3. At each time position p of the streaming S, performthe following steps: 1. Form the querying series T p = S[p L +1:p] 2. Find the nearest neighbor of T p as follows: 2.1 Do the Nearest Neighbor Search as step 1 in Figure Determine the threshold as follows: (a) Get TH range as step 2in Figure 2; (b) Find the real distance, d pre, fromt p to the nearest neighbor at previous position p 1; (c) Update threshold TH range =min(th range,d pre) 2.3 Get candidates as step 3 in Figure 2; 2.4 Find the nearest neighbor as step 4 in Figure 2; 3. Report the nearest neighbor; 4. Insert the index pages and candidates of this evaluation into cache memory. Figure 3: The Direct-Index algorithm. Compared with the traditional nearest neighbor algorithms, the Direct-Index algorithm needs less time to find the nearest neighbor of the streaming S at each position. One reason is the reduced search space, and the other is some cached index pages and candidate pattern series. As a result, both the processing time and the response time are shorter. 3.3 The Pre-fetch algorithm The Direct-Index algorithm exploits the continuity of streaming time series to reduce the processing time. We may use the other property, namely predictability, to further reduce the response time, even at time positions where continuity is not strong. This is the basis for the pre-fetch algorithm. As mentioned in Section 2, at many time positions, the next value of the streaming series can be predicted reasonably well. This provides the opportunity to obtain fast response time by using the idle time before the actual value arrives. Instead of caching the resources used by current evaluation, we can load the index pages and candidate series that are more likely needed by the next evaluation into the memory. In order to do so, we perform the Direct-Index algorithm on the predicted querying time series. All the index pages and candidate series accessed in this dry run are all potential targets to be cached and used when the actual value arrives and real evaluation is performed. Hence, the preparation step is for the purpose of loading the needed index page and candidate series. This will be most useful when the continuity of the series is weak, i.e., the two successive querying series are not similar to each other. Indeed, in this case, the index pages and candidate series used by the previous evaluation may not be useful for the next one, while pre-fetching will load the correct (based on prediction) index pages and candidate series. Since this pre-fetching method actively loads the useful resources into the cache and thus canbeviewedasanactive caching strategy. Consider the streaming time series S. At current time position p, the first L 1 values of the next querying series T p+1 have already arrived by our assumption. We only need the next value S[p + 1] to form the entire querying series. In most cases, the value S[p + 1] can be predicted with the real world models and the statistical inference, and the prediction of S[p + 1] is denoted Ŝ[p + 1]. With this prediction, we can get the predicted querying series at next position ˆT p+1 = S[p +2 L],...,S[p], Ŝ[p +1]. One simplest prediction of the S[p +1]is to let Ŝ[p +1]= S[p], i.e., use the previous value as the prediction of the next one. We can easily show the following: Proposition 1. Assume that the prediction of streaming value at position p +1 is Ŝ[p +1]=S[p], andthepredicted time series of T p+1 is ˆT p+1 where ˆT p+1 = S[p L + 2],...,S[p], Ŝ[p +1], thend(tp+1, ˆT p+1) D(T p+1, T p). The proof is straightforward by the definition of the Euclidean distance. This result means that even with this simplest prediction, the predicted querying series is closer to the next querying series than the current querying series does. Therefore, it makes sense to use the predicted querying series to load the indexing pages and candidate series than to use the previous querying time series. When an application has precise prediction of values in the stream, it will be more beneficial to use than the simple mothed given here. Once we have the predicted querying series, we can use them to fetch the index pages and candidates by evaluating a nearest neighbor query and a range query, i.e., step 2 in Figure 3. These pages will stay in the cache memory to be used when the actual value arrives. For large databases, the pre-fetched index pages and candidates may be too large to fit in the cache memory, while these pages are very likely be used soon. A solution is to take advantage of the sequential files for the overflowed can- 488

5 At each position p of a streaming time series, 1. Finding the nearest neighbor of S at position p 1.1 Form the querying series T p = S[p L +1:p]. 1.2Find the nearest neighbor of T p: (a) Do the nearest neighbor search in feature space as step 1 in Figure 2. (b) Determine the threshold as follows: (1) Get TH range as step 2.2 in Figure 3; (2) Get the smallest distance d cad from T p to the candidates cached by the previous evaluations: - if the candidate is in the cache memory, use it directly; - otherwise read it from the sequential file. (3) Update threshold TH range = min(th range,d cad ). (c) Get the candidates as step 3 in Figure 2. (d) Exclude the candidates that are used in 1.2(b.2), except for the one that yields d cad. (e) Find the nearest neighbor as step 4 in Figure Report the nearest neighbor. 2. Pre-fetching for the next evaluation at position p 2.1 Predict the data at position p +1, Ŝ[p + 1], and form the predicted querying series: ˆT p+1 = S[p L 2,...,S[p], Ŝ[p +1]] ; 2.2 Pre-fetch index pages into cache memory: (a) Do the nearest neighbor search with ˆT p+1 in feature space. (b) Determine the threshold as step 2.2 in Figure 3. (c) Get candidates as step 3 in Figure 2. (d) Insert the index pages accessed in step 2.2(a) and 2.2(c) into cache. 2.3 Pre-fetch each candidate into cache memory: -if the cache memory is not full, insert the candidate into the cache; -otherwise insert it into a sequential file. didate series. When pre-fetched candidates cannot fit in the cache, we store them into a sequential file. Once the next evaluation begins, this file will be read in sequentially as the first step of evaluation, to calculate the distance between the querying series and these candidates. When the threshold is needed for the range query step of the algorithm (like in step 4 of Figure 2 or step 2.4 of Figure 3), three distances are compared, namely, (1) the least distance from the above calculations, (2) the distance from the querying series to the nearest neighbor at previous position, and (3) the distance from the querying series to the nearest neighbor found in the feature space. The least of the three distances is then used as the threshold to issue the range query. Compared with random accesses, sequential files saves the I/O cost. Therefore, we can expect that this strategy helps reduce the response time when the cache memory is small. We do not use sequential files to store overflowed index pages from the dry run since the access to these index pages in the next evaluation is likely in a random pattern, and the advantage of the sequential file is not obvious. Thus, in the cache memory, we reserve a part for index pages and the remaining for the candidates. Combining all the above strategies, the Pre-fetch algorithm is shown in Figure 4. Algorithm Scan Description Cache candidates and all feature points. If the cache is not large enough, store the un-cached feature points in a sequential file. Direct-Index Cache the pages and candidates of the previous evaluations. Pre-fetch Predict the next querying series, evaluate with this predicted series and cache the index pages and candidates of it. If the cache is not large enough, store the un-cached candidates in a sequential file. Table 2: Summary of CNNQ algorithms For comparison purposes, we also use a sequential scan algorithm to deal with the continuous queries, which is named the Scan algorithm. All these three algorithms to evaluate the nearest neighbor queries of streaming time series are summarized in Table 2. Figure 4: The Pre-fetch algorithm. 4. PERFORMANCE EVALUATION Our goal of this section is to evaluate the performance of the proposed Pre-fetch algorithm, as compared with the Direct-Index and Scan algorithms. Two types of data are used in the experiments, one real data set and two synthetic data sets. The real data set consists of American stock daily prices, with 7,6 original price series. We use 7,5 of them to generate 1, pattern series (by picking up their subseries) with length of 128 and 256, respectively. Each value of the series is stored as float number, and thus the two pattern datasets are about 5MB and 1MB, respectively. We randomly pick one of the remaining 1 stock series as the streaming time series(queries), with a length of 124 values. Although the stock streaming data come in very slowly (daily), we intend to see the impact of the algorithms with the real world data and thus assume these data arrive fast in the experiment. The two synthetic data sets, each consisting of 1,, pattern series, are generated with a random walk function, v[k] =v[k 1] + 1 RAND[k], while v[k] isthekth value ofthetimeseriesandrand[k] is a random number that is uniformly distributed on the interval (-.5,.5). The pattern lengths in the two data sets are 128 and 256, respectively. Hence the corresponding sizes of the data sets are 5MB and 1GB. The streaming time series with length of 124 is independently generated by the same random walk function. We use the DWT function to perform the dimensionality reduction and use R-tree to index these transformed data points. We choose the index dimensionality from 4, 6, 8 or 1, and the page size from 124, 248 or 496. The experiments are performed with all combinations of these dimensions and the page sizes, but we only show the results with index dimensionality of 6 and page size of 496. Since the response time is dominated by the I/O operations for the CNNQ, we use the page faults as the measure of response time. In the experiments, we trace the page faults generated during each step of the algorithm. The same page accessed in one evaluation is only counted once, although it 489

6 may be accessed several times. One page fault is regarded as one random page access, while the the page faults generated by a sequential scan is normalized 3 by dividing the number of accessed pages by 1. The experiments are coded in the programming language C++ and are performed on a dedicated desktop computer. Since we do not record the wall clock time, and only trace the pages that are accessed, the system environment does not make much difference to the results of the experiments. 4.1 Advantage over traditional algorithms Before we move on to the main task of this section, we report our experiments that show how the use of continuity helps the performance of our algorithms. As mentioned in Section 3, all our algorithms for CNNQ use the result of previous evaluation to limit the search space of the current evaluation (see step 2.2 of Figure 3, and step 1.2(b.2) of Figure 4). To see the importance of this strategy, we compare the algorithms with and without the corresponding steps. The difference can be seen from the number of leaves of the index that need to be accessed by these different versions. # of Candidate Leaves(Fanout =.5) TRAD : PSize=124 TRAD : PSize=248 TRAD : PSize=496 CNNQ: PSize=124 CNNQ: PSize=248 CNNQ: PSize= Index Dimensionality # of Candidate Leaves(Fanout =.5) TRAD : PSize=124 TRAD : PSize=248 TRAD : PSize=496 CNNQ: PSize=124 CNNQ: PSize=248 CNNQ: PSize= Index Dimensionality Figure 5: The candidate leaves accessed: Stock Data. (TRAD=traditional algorithms) We performed the experiments on measuring the number of the candidate leaves, the pages that contain the candidate pattern times series. The results with stock data are shown in Figure 5. The number of the candidate leaves for CNNQ algorithms is much less than the traditional ones. Note the graphs in the figures may not directly show the number of page faults for accessing the candidate pattern series. One leaf of the lower dimensional index may hold more feature points than the number of time series that can be held in one page (same size as the leaf). Thus, the corresponding number of page faults will be higher than the number of candidate leaves. Of course, given the data size of the pattern series, we can use Figure 5 to estimate the number of page faults for obtaining the candidate pattern series, especially when all the pattern series indexed by one leaf are clustered on the disk. The same experiments are also performed with the random walk data, which give the similar results. 4.2 Performance of CNNQ algorithms In a large database, the I/O operations dominate the response time. There are two types of page faults that will occur for the CNNQ algorithms. One is the index page faults, which occur when to search the index. The other one is the verification page faults whichoccurwhentofetchthe candidate pattern series to perform the verification. 3 This responds to the fact discovered in many experiments that accessing to 1% of the pages randomly is similar in terms of I/O time to scanning the whole 1% of the pages. In order to better analyze how the algorithms will perform under different sizes of cache memory, we first compare the index page faults and the verification page faults separately. After that, we will give the overall page fault comparison. The overall page faults will determine the response time that each algorithm can achieve. In these experiments, we use the LRU strategy to manage the cache. First we study the index page faults of the three algorithms: Scan, Direct-Index and Pre-fetch. Theresultsfor the stock data set are shown in Figure 6(a), with index dimensionality of 6 and pattern length of 128 and 256, respectively. In this experiment, we give different sizes of cache memory where the index pages fetched by one evaluation of one algorithm could be stored. We issue the nearest neighbor search of streaming series with Direct-Index and Prefetch algorithms. Since we are interesting in the long term index page fault measure, the initial stage of the query is ignored because there are too little index pages being cached. After the initial stage, we record the index page faults at each position and take the average as the measure. Note the Scan algorithm does not use the index at all, and the index page faults of it are determined by the pattern numbers and index dimensionality. We calculate the index page faults of the Scan algorithm for each data set instead of really doing experiment to get these values. From Figure 6(a), we can see that the index page faults decrease to zero with Pre-fetch when the cache memory size increases to about 75KB and 7KB, for length of 128 and 256 respectively. The performance of the Scan algorithm is linear to the cache memory size, and the index page faults will decrease to zero as the cache memory size is greater than 2.4MB since all feature points are stored in the cache. Note in case that the cache size is small, the overhead of the Scan algorithm is so huge that the page faults can not be shown in Figure 6(a). The index page faults, for both Direct-Index and Pre-fetch algorithms, drop quickly when the cache size is near 5KB, which will contain about 12 index pages. When the cache memory cannot even store the index pages for one evaluation, the I/O overhead will be huge. As the cache memory increases large enough, that is, greater than 7KB, both Direct-Index and Pre-fetch algorithms can work very well. But we can see that the Pre-fetch can achieve near zero page faults, while the Direct-Index still has about one page fault even the cache becomes bigger. Although pre-fetching the index pages can help to reduce the page faults, the difference between Pre-fetch and Direct- Index is not very obvious, especially for the stock data. The reason is two-fold: one is that the querying series is very similar to the previous one, and their corresponding feature points are even closer. The other is that each index page contains many feature points, so the index page accessed with Direct-Index algorithm also has the high possibility to be visited in the next evaluation again. The similar study is carried out for page faults generated from the verification step (step 2.4 in Figure 3, and step 1.2(b.2) and 1.2(e) in Figure 4). In these experiments, we assume that the pattern series are stored in the database randomly, and each time one candidate is fetched from disk will result in one page fault. If the candidate is already in the memory, there will be no page fault. For the Prefetch algorithm, all cached candidates are accessed first (step 1.2(b.2) in Figure 4). These candidates may be accessed again in the subsequent steps, but no page fault is generated. 49

7 SCAN 25 SCAN Index Page Faults Index Page Faults Verification Page Faults Verification Page Faults Overall Page Faults Overall Page Faults (a) Index Page Faults (b) Verification Page Faults (c) Overall Page Faults Figure 6: The page faults: Stock Data. The verification page faults with stock data set are shown in Figure 6(b). From these graphs, we can observe that the Pre-fetch algoirthm outperforms the Direct-Index algorithm even when the cache memory is small. For example, in Figure 6(b) with the pattern length L = 128 and without cache memory, the Pre-fetch generates only 18 verification page faults while the Direct-Index has more than 8 faults. When the cache size is 5KB, the Pre-fetch still outperforms the Direct-Index with 1 page faults. Note the difference between the two algorithms is relatively smaller when the pattern length is 256, since the continuity is much higher when the pattern length is larger. This property is addressed in the Section 2. We do not include the Scan algorithm in this experiment since these page faults generated by it must be the same as the Direct-Index algorithm does. We are now ready to report the overall page faults for each algorithm. Since the Scan algorithm has too much overhead in searching the index while the verification page faults are the same as the Direct-Index algorithm, we just compare the other two algorithms, Pre-fetch and Direct-Index. The overall page fault curves are found with the optimal assignment of the cache memory into two parts, one for index pages and the other for candidates. Given a cache memory that can be used to cache either index pages or candidates, the optimal assignment is to let the summation of index page faults and verification page faults be the smallest one among all the possibilities of taking any part of this cache to store index, and the remainder to store candidates. The overall page faults with the stock data is shown in Figure 6(c) 4. From the graphs, we can see that even without any cache 5,thePre-fetch algorithm only generates about half the page faults of that by the Direct-Index algorithm. When the memory size becomes.5mb, with pattern length is 128 and 256, respectively, the differences are 35 and 43 correspondingly. When the memory size is large enough, both Pre-fetch and Direct-Index can achieve the similar performance with nearly no page fault. The experiments with the random walk data yield the similar results as that with stock data. The overall page 4 When compare the page faults in the graph, note the gap of the curves is much bigger than it may appear. Indeed, the two bands seem to be close to each but there is a big difference in vertical values at the the same cache size. 5 When we say without any cache, we mean that all the contents in the working memory will need to be fetched from the disk, generating page faults, either through the use of sequential file or from the original locations. Overall Page Faults Overall Page Faults Figure 7: The overall page faults: Random Walk Data. faults with the random walk data are illustrated in Figure 7. Clearly, given the same cache memory, the Pre-fetch algorithm outperforms the Direct-Index algorithm greatly. From another point of view, that is, how large the cache is needed in order to reach the near zero page fault, the Pre-fetch algorithm can get this goal with less than 1MB or 15MB, with pattern length equal to 128 or 256. But the Direct-Index algorithm needs about 13MB or 2MB to get the similar performance. 5. RELATED WORK The topic of this paper falls into the general category of continuous queries, which have long been considered useful. In 1992, Terry et al. [24] introduced the notion of continuous queries. They proposed an incremental approach to evaluate the queries on append-only databases. Similarly, Parker et al. (e.g., [19]) considered queries on data streams. Continuous queries have also appeared in Liu et al. [16] as continual queries, where more general scenarios are considered. Recently, due to the growing needs from new applications, continuous queries become an increasingly important research subject. Chen et al. [7, 6] designed the NiagraCQ system in which the incremental query evaluation method is no longer restricted to append-only data sources. Babu and Widom [2], and Madden and Franklin [17] also reported system architecture and related issues for continuous queries. Gehrke [11] et al. proposed a single-pass technique for computing correlated aggregation of data stream. In this paper, we study a special type of continuous query, namely nearest neighbor queries of the streaming time series. In our prior research [9], we considered the similar continuous nearest and near neighbor queries with the assumption that the number of pattern series is small enough so that all the pattern series can fit in the memory. We proposed a prediction-based batch processing method which yields su- 491

8 perior algorithms compared with a direct in-memory scan method. When the number of pattern series is large, the in-memory strategies do not work well. In this paper, we proposed the CNNQ algorithms to deal with this situation. In our latest work [1], given different stream rates and other constrains, we implemented the Direct-Index and Pre-fetch algorithms and compared their performance in terms of drop ratio and response time. Those experiments also proved that Pre-fetch algorithm outperforms Direct-Index method. Another related research area is the near and nearest neighbor queries of time series. The general approaches, see [1, 8, 2, 5], are to map the time series into the frequency domain, and then index the significant part of the coefficients using certain high-dimensional indexing structure. (A recent survey about high-dimensional indexes can be found in [3].) However, the research has concentrated on the query of individual time series. In a sense, this paper can be viewed as an extension of the indexing work to the continuous query scenario. Pre-fetching is a well-known approach for improving performance of file systems. People have done a lot of great work in this field, e.g., [25, 13, 23] and a recent survey can be found at [18]. However, as far as we know, this paper is the first one that uses pre-fetching for continuous queries. 6. CONCLUSION In this paper, we studied the continuous nearest neighbor query for a streaming time series. We extended the traditional index-based algorithms to take advantage of the continuity and predictability of the streaming series. With continuity, the traditional algorithms can be improved immediately with great improvements. We made a step further and used caching and prediction to prepare for the next evaluation to reduce the response time. We reported experimental results that showed the effectiveness of our algorithms. The contribution of the paper is in its use of properties of the continuous nearest neighbor query to achieve better performance. The techniques used, namely caching and prefetching, should be generally applicable to many different types of continuous queries. Especially interesting is the use of pre-fetching with the sequential file technique. This technique is most useful in many situations where the cache memory is small but response time requirement is stringent. Intuitively, there is a trade-off between response time and how much time we use to prepare for the actual evaluation. Generally, the more time we have to prepare, the faster the response time will be. However, in some applications, we do not always have the luxury to become well prepared for the next evaluation. It will be interesting to study a system that can adapt itself in this regard. That is, use as much time as the situation allows to prepare for the next evaluation, but can stop at any moment to respond to the actual evaluation request. Another interesting research direction is to study the impact of the continuities on query performance, that is, when the characteristics of the query streaming time series are subject to variation over time, how the corresponding response times change with these algorithms. In this paper, we used the R-tree [12] for our basic implementation of the nearest neighbor search algorithm. However, our strategies are not restricted to R-tree. It is not difficult to see that the strategies can be implemented in many other indexing structures such as X-tree and M-tree, etc. We chose LRU strategy here to manage the cache, while other strategies can also work with CNNQ algorithms and may yield better performance. Although we only dealt with the nearest neighbor search in the paper, it is easy to lift our algorithms for the k-nearest neighbor and range queries for streaming time series. 7. REFERENCES [1] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In FODO, [2] S. Babu and J. Widom. Continuous queries over data streams. In SIGMOD Record, Sept. 21, 21. [3] S. Berchtold and D. A. Keim. High-dimensional index structures, database support for next decade s applications (tutorial). In ACM SIGMOD, [4] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree: An index structure for high-dimensional data. In VLDB, pages 28 39, [5] K.-P. Chan and A. W.-C. Fu. Efficient time series matching by wavelets. In ICDE, pages , [6] J. Chen, D. J. DeWitt, and J. F. Naughton. Design and evaluation of alternative selection placement strategies in optimizing continuous queries. In ICDE, 22. [7] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for Internet databases. In ACM SIGMOD, pages , 2. [8] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In ACM SIGMOD, pages , [9] L. Gao and X. S. Wang. Continually evaluating similarity-based pattern queries. In SIGMOD, 22. [1] L. Gao and X. S. Wang. Improving the performance of continuous queries on fast data streams: Time series case. In DMKD, 22. [11] J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD, 21. [12] A. Guttman. R trees: A dynamic index structure for spatial searching. In ACM SIGMOD, pages 47 57, [13] H. Seok Jeon and Sam H. Noh. A database disk buffer management algorithm based on prefetching. In CIKM, pages , [14] E. J. Keogh, K. Chakrabarti, S. Mehrotra, and M. J. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. In SIGMOD, 21. [15] F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. Fast nearest neighbor search in medical image databases. In VLDB, pages , [16] L. Liu, C. Pu, and W. Tang. Continual queries for Internet scale event-driven information delivery. IEEE TKDE, 11(4):61 628, [17] S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In ICDE, 22. [18] N. Oren. A survey of prefetching techniques. [19] D. S. Parker, R. R. Muntz, and H. Lewis Chau. The tangram stream query processing system. In ICDE, [2] D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. In ACM SIGMOD, pages 13 25, [21] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, pages 71 79, [22] T. Seidl and H.-P. Kriegel. Optimal multi-step k-nearest neighbor search. In SIGMOD, pages , [23] A. Sinha and C. M. Chase. Prefetching and caching for query scheduling in a special class of distributed applications. In ICPP, Vol. 3, pages 95 12, [24] D.Terry,D.Goldberg,D.Nichols,andB.Oki.Continuous queries over append-only databases. In ACM SIGMOD, pages , [25] J. S. Vitter and P. Krishnan. Optimal prefetching via data compression. Journal of the ACM, 43(5): ,

0.9. Percentage of successive NN changed. Stock Data: 10 5 patterns Pattern Length

0.9. Percentage of successive NN changed. Stock Data: 10 5 patterns Pattern Length !#"%$ &'(*),+%(-./12$ +3+4356(7$98:.;#.=?@AB+DCE+ F#.GAB $9H I.B(J2KML6N*OQPPRSR2TVUUUU ;

More information

Similarity Search in Time Series Databases

Similarity Search in Time Series Databases Similarity Search in Time Series Databases Maria Kontaki Apostolos N. Papadopoulos Yannis Manolopoulos Data Enginering Lab, Department of Informatics, Aristotle University, 54124 Thessaloniki, Greece INTRODUCTION

More information

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery Ninh D. Pham, Quang Loc Le, Tran Khanh Dang Faculty of Computer Science and Engineering, HCM University of Technology,

More information

High Dimensional Indexing by Clustering

High Dimensional Indexing by Clustering Yufei Tao ITEE University of Queensland Recall that, our discussion so far has assumed that the dimensionality d is moderately high, such that it can be regarded as a constant. This means that d should

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

A Disk Head Scheduling Simulator

A Disk Head Scheduling Simulator A Disk Head Scheduling Simulator Steven Robbins Department of Computer Science University of Texas at San Antonio srobbins@cs.utsa.edu Abstract Disk head scheduling is a standard topic in undergraduate

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

A Case for Merge Joins in Mediator Systems

A Case for Merge Joins in Mediator Systems A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract

More information

Max-Count Aggregation Estimation for Moving Points

Max-Count Aggregation Estimation for Moving Points Max-Count Aggregation Estimation for Moving Points Yi Chen Peter Revesz Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA Abstract Many interesting problems

More information

Similarity Search: A Matching Based Approach

Similarity Search: A Matching Based Approach Similarity Search: A Matching Based Approach Anthony K. H. Tung Rui Zhang Nick Koudas Beng Chin Ooi National Univ. of Singapore Univ. of Melbourne Univ. of Toronto {atung, ooibc}@comp.nus.edu.sg rui@csse.unimelb.edu.au

More information

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Search K Nearest Neighbors on Air

Search K Nearest Neighbors on Air Search K Nearest Neighbors on Air Baihua Zheng 1, Wang-Chien Lee 2, and Dik Lun Lee 1 1 Hong Kong University of Science and Technology Clear Water Bay, Hong Kong {baihua,dlee}@cs.ust.hk 2 The Penn State

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

On Similarity-Based Queries for Time Series Data

On Similarity-Based Queries for Time Series Data On Similarity-Based Queries for Time Series Data Davood Rafiei Department of Computer Science, University of Toronto E-mail drafiei@db.toronto.edu Abstract We study similarity queries for time series data

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory Shawn Koch Mark Doughty ELEC 525 4/23/02 A Simulation: Improving Throughput and Reducing PCI Bus Traffic by Caching Server Requests using a Network Processor with Memory 1 Motivation and Concept The goal

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Virtual Memory 1 Chapter 8 Characteristics of Paging and Segmentation Memory references are dynamically translated into physical addresses at run time E.g., process may be swapped in and out of main memory

More information

CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS

CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS International Journal of Wireless Communications and Networking 3(1), 2011, pp. 7-13 CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS Sudhanshu Pant 1, Naveen Chauhan 2 and Brij Bihari Dubey 3 Department

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

Verification and Validation of X-Sim: A Trace-Based Simulator

Verification and Validation of X-Sim: A Trace-Based Simulator http://www.cse.wustl.edu/~jain/cse567-06/ftp/xsim/index.html 1 of 11 Verification and Validation of X-Sim: A Trace-Based Simulator Saurabh Gayen, sg3@wustl.edu Abstract X-Sim is a trace-based simulator

More information

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo

More information

Chapter 8 Virtual Memory

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Modified by Rana Forsati for CSE 410 Outline Principle of locality Paging - Effect of page

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information

Relational Implementation of Multi-dimensional Indexes for Time Series

Relational Implementation of Multi-dimensional Indexes for Time Series Relational Implementation of Multi-dimensional Indexes for Time Series Tomasz Nykiel 1 and Parke Godfrey 2 1 University of Toronto, Toronto ON tnykiel@cs.toronto.edu 2 York University, Toronto ON godfrey@cse.yorku.ca

More information

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,

More information

Nowadays data-intensive applications play a

Nowadays data-intensive applications play a Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted

More information

A Quantization Approach for Efficient Similarity Search on Time Series Data

A Quantization Approach for Efficient Similarity Search on Time Series Data A Quantization Approach for Efficient Similarity Search on Series Data Inés Fernando Vega LópezÝ Bongki Moon Þ ÝDepartment of Computer Science. Autonomous University of Sinaloa, Culiacán, México ÞDepartment

More information

Role of OS in virtual memory management

Role of OS in virtual memory management Role of OS in virtual memory management Role of OS memory management Design of memory-management portion of OS depends on 3 fundamental areas of choice Whether to use virtual memory or not Whether to use

More information

Probabilistic Modeling of Leach Protocol and Computing Sensor Energy Consumption Rate in Sensor Networks

Probabilistic Modeling of Leach Protocol and Computing Sensor Energy Consumption Rate in Sensor Networks Probabilistic Modeling of Leach Protocol and Computing Sensor Energy Consumption Rate in Sensor Networks Dezhen Song CS Department, Texas A&M University Technical Report: TR 2005-2-2 Email: dzsong@cs.tamu.edu

More information

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic tomas.skopal@vsb.cz

More information

External Sorting Sorting Tables Larger Than Main Memory

External Sorting Sorting Tables Larger Than Main Memory External External Tables Larger Than Main Memory B + -trees for 7.1 External Challenges lurking behind a SQL query aggregation SELECT C.CUST_ID, C.NAME, SUM (O.TOTAL) AS REVENUE FROM CUSTOMERS AS C, ORDERS

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

Garbage Collection (2) Advanced Operating Systems Lecture 9

Garbage Collection (2) Advanced Operating Systems Lecture 9 Garbage Collection (2) Advanced Operating Systems Lecture 9 Lecture Outline Garbage collection Generational algorithms Incremental algorithms Real-time garbage collection Practical factors 2 Object Lifetimes

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Shapes Based Trajectory Queries for Moving Objects

Shapes Based Trajectory Queries for Moving Objects Shapes Based Trajectory Queries for Moving Objects Bin Lin and Jianwen Su Dept. of Computer Science, University of California, Santa Barbara Santa Barbara, CA, USA linbin@cs.ucsb.edu, su@cs.ucsb.edu ABSTRACT

More information

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose

More information

Cache Management for Shared Sequential Data Access

Cache Management for Shared Sequential Data Access in: Proc. ACM SIGMETRICS Conf., June 1992 Cache Management for Shared Sequential Data Access Erhard Rahm University of Kaiserslautern Dept. of Computer Science 6750 Kaiserslautern, Germany Donald Ferguson

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

Computing Data Cubes Using Massively Parallel Processors

Computing Data Cubes Using Massively Parallel Processors Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li {luhj,huangxia,lizhixia}@iscs.nus.edu.sg Department of Information Systems and Computer Science National University

More information

Data Access Paths for Frequent Itemsets Discovery

Data Access Paths for Frequent Itemsets Discovery Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

GEMINI GEneric Multimedia INdexIng

GEMINI GEneric Multimedia INdexIng GEMINI GEneric Multimedia INdexIng GEneric Multimedia INdexIng distance measure Sub-pattern Match quick and dirty test Lower bounding lemma 1-D Time Sequences Color histograms Color auto-correlogram Shapes

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

Chapter 8 Virtual Memory

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Operating Systems: Internals and Design Principles You re gonna need a bigger boat. Steven

More information

A reversible data hiding based on adaptive prediction technique and histogram shifting

A reversible data hiding based on adaptive prediction technique and histogram shifting A reversible data hiding based on adaptive prediction technique and histogram shifting Rui Liu, Rongrong Ni, Yao Zhao Institute of Information Science Beijing Jiaotong University E-mail: rrni@bjtu.edu.cn

More information

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Rada Chirkova Department of Computer Science, North Carolina State University Raleigh, NC 27695-7535 chirkova@csc.ncsu.edu Foto Afrati

More information

Energy Conservation of Sensor Nodes using LMS based Prediction Model

Energy Conservation of Sensor Nodes using LMS based Prediction Model Energy Conservation of Sensor odes using LMS based Prediction Model Anagha Rajput 1, Vinoth Babu 2 1, 2 VIT University, Tamilnadu Abstract: Energy conservation is one of the most concentrated research

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

SeqIndex: Indexing Sequences by Sequential Pattern Analysis

SeqIndex: Indexing Sequences by Sequential Pattern Analysis SeqIndex: Indexing Sequences by Sequential Pattern Analysis Hong Cheng Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign {hcheng3, xyan, hanj}@cs.uiuc.edu

More information

University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision

University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision report University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision Web Server master database User Interface Images + labels image feature algorithm Extract

More information

Module 6 NP-Complete Problems and Heuristics

Module 6 NP-Complete Problems and Heuristics Module 6 NP-Complete Problems and Heuristics Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu P, NP-Problems Class

More information

An index structure for efficient reverse nearest neighbor queries

An index structure for efficient reverse nearest neighbor queries An index structure for efficient reverse nearest neighbor queries Congjun Yang Division of Computer Science, Department of Mathematical Sciences The University of Memphis, Memphis, TN 38152, USA yangc@msci.memphis.edu

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Chapter 8 Virtual Memory What are common with paging and segmentation are that all memory addresses within a process are logical ones that can be dynamically translated into physical addresses at run time.

More information

White paper ETERNUS Extreme Cache Performance and Use

White paper ETERNUS Extreme Cache Performance and Use White paper ETERNUS Extreme Cache Performance and Use The Extreme Cache feature provides the ETERNUS DX500 S3 and DX600 S3 Storage Arrays with an effective flash based performance accelerator for regions

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Chapter 3 - Memory Management

Chapter 3 - Memory Management Chapter 3 - Memory Management Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Memory Management 1 / 222 1 A Memory Abstraction: Address Spaces The Notion of an Address Space Swapping

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering

More information

File Size Distribution on UNIX Systems Then and Now

File Size Distribution on UNIX Systems Then and Now File Size Distribution on UNIX Systems Then and Now Andrew S. Tanenbaum, Jorrit N. Herder*, Herbert Bos Dept. of Computer Science Vrije Universiteit Amsterdam, The Netherlands {ast@cs.vu.nl, jnherder@cs.vu.nl,

More information

Adaptive Middleware for Distributed Sensor Environments

Adaptive Middleware for Distributed Sensor Environments Adaptive Middleware for Distributed Sensor Environments Xingbo Yu, Koushik Niyogi, Sharad Mehrotra, Nalini Venkatasubramanian University of California, Irvine {xyu, kniyogi, sharad, nalini}@ics.uci.edu

More information

Record Placement Based on Data Skew Using Solid State Drives

Record Placement Based on Data Skew Using Solid State Drives Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories, NEC j-suzuki@ax.jp.nec.com

More information

Design and Implementation of A P2P Cooperative Proxy Cache System

Design and Implementation of A P2P Cooperative Proxy Cache System Design and Implementation of A PP Cooperative Proxy Cache System James Z. Wang Vipul Bhulawala Department of Computer Science Clemson University, Box 40974 Clemson, SC 94-0974, USA +1-84--778 {jzwang,

More information

Speeding up Queries in a Leaf Image Database

Speeding up Queries in a Leaf Image Database 1 Speeding up Queries in a Leaf Image Database Daozheng Chen May 10, 2007 Abstract We have an Electronic Field Guide which contains an image database with thousands of leaf images. We have a system which

More information

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems Query Processing: A Systems View CPS 216 Advanced Database Systems Announcements (March 1) 2 Reading assignment due Wednesday Buffer management Homework #2 due this Thursday Course project proposal due

More information

COOCHING: Cooperative Prefetching Strategy for P2P Video-on-Demand System

COOCHING: Cooperative Prefetching Strategy for P2P Video-on-Demand System COOCHING: Cooperative Prefetching Strategy for P2P Video-on-Demand System Ubaid Abbasi and Toufik Ahmed CNRS abri ab. University of Bordeaux 1 351 Cours de la ibération, Talence Cedex 33405 France {abbasi,

More information

Nearest Neighbor Search on Vertically Partitioned High-Dimensional Data

Nearest Neighbor Search on Vertically Partitioned High-Dimensional Data Nearest Neighbor Search on Vertically Partitioned High-Dimensional Data Evangelos Dellis, Bernhard Seeger, and Akrivi Vlachou Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Straße,

More information

Comp Online Algorithms

Comp Online Algorithms Comp 7720 - Online Algorithms Notes 4: Bin Packing Shahin Kamalli University of Manitoba - Fall 208 December, 208 Introduction Bin packing is one of the fundamental problems in theory of computer science.

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

Value Added Association Rules

Value Added Association Rules Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution Announcements (March 1) 2 Query Processing: A Systems View CPS 216 Advanced Database Systems Reading assignment due Wednesday Buffer management Homework #2 due this Thursday Course project proposal due

More information

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System Xiaodong Shi Email: shixd.hust@gmail.com Dan Feng Email: dfeng@hust.edu.cn Wuhan National Laboratory for Optoelectronics,

More information

Network Load Balancing Methods: Experimental Comparisons and Improvement

Network Load Balancing Methods: Experimental Comparisons and Improvement Network Load Balancing Methods: Experimental Comparisons and Improvement Abstract Load balancing algorithms play critical roles in systems where the workload has to be distributed across multiple resources,

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Effective Pattern Similarity Match for Multidimensional Sequence Data Sets

Effective Pattern Similarity Match for Multidimensional Sequence Data Sets Effective Pattern Similarity Match for Multidimensional Sequence Data Sets Seo-Lyong Lee, * and Deo-Hwan Kim 2, ** School of Industrial and Information Engineering, Hanu University of Foreign Studies,

More information

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure Lecture 6: External Interval Tree (Part II) Yufei Tao Division of Web Science and Technology Korea Advanced Institute of Science and Technology taoyf@cse.cuhk.edu.hk 3 Making the external interval tree

More information

Summary Cache based Co-operative Proxies

Summary Cache based Co-operative Proxies Summary Cache based Co-operative Proxies Project No: 1 Group No: 21 Vijay Gabale (07305004) Sagar Bijwe (07305023) 12 th November, 2007 1 Abstract Summary Cache based proxies cooperate behind a bottleneck

More information

An Automatic Hole Filling Method of Point Cloud for 3D Scanning

An Automatic Hole Filling Method of Point Cloud for 3D Scanning An Automatic Hole Filling Method of Point Cloud for 3D Scanning Yuta MURAKI Osaka Institute of Technology Osaka, Japan yuta.muraki@oit.ac.jp Koji NISHIO Osaka Institute of Technology Osaka, Japan koji.a.nishio@oit.ac.jp

More information

An Efficient Execution Scheme for Designated Event-based Stream Processing

An Efficient Execution Scheme for Designated Event-based Stream Processing DEIM Forum 2014 D3-2 An Efficient Execution Scheme for Designated Event-based Stream Processing Yan Wang and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba

More information

Page Replacement. (and other virtual memory policies) Kevin Webb Swarthmore College March 27, 2018

Page Replacement. (and other virtual memory policies) Kevin Webb Swarthmore College March 27, 2018 Page Replacement (and other virtual memory policies) Kevin Webb Swarthmore College March 27, 2018 Today s Goals Making virtual memory virtual : incorporating disk backing. Explore page replacement policies

More information

A Framework for Clustering Massive Text and Categorical Data Streams

A Framework for Clustering Massive Text and Categorical Data Streams A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract

More information

L9: Storage Manager Physical Data Organization

L9: Storage Manager Physical Data Organization L9: Storage Manager Physical Data Organization Disks and files Record and file organization Indexing Tree-based index: B+-tree Hash-based index c.f. Fig 1.3 in [RG] and Fig 2.3 in [EN] Functional Components

More information

Maintenance of the Prelarge Trees for Record Deletion

Maintenance of the Prelarge Trees for Record Deletion 12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31, 2007 105 Maintenance of the Prelarge Trees for Record Deletion Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu Department of

More information