Fractal Based Approach for Indexing and Querying Heterogeneous Data Streams

Fractal Based Approach for Indexing and Querying Heterogeneous Data Streams G. Netaji & MHM. Krishna Prasad Dept. of IT, JNTUK UCEV, Vizianagaram, A.P., 535 003, India E-mail: netaji.gandi@gmail.com, krishnaprasad.mhm@gmail.com Abstract Efficient handling of data stream is become mandatory in various domains such as business, finance, sensor networks etc. Managing these data streams poses challenges that are distinct from those addressed by traditional database management systems. The enormous increase of cellular phones, GPS-like devices, and RFIDs results in highly dynamic environments where objects as well as queries are continuously moving. This paper presents a continuous query processor designed specifically for highly dynamic environments. This paper presents a data stream indexing system that satisfies the requirements in evolving data streams. Finally, applying the fractal based indexing scheme that enable the system for fast indexing and querying. This paper also address the issue of memory based and disk based fractal management to support skew distributions of data stream. Keywords Data stream, Indexing, Querying, Self similarity, Fractal Dimension, Fractal Tree. The existing query processing techniques focus on solving special cases of continuous spatio-temporal queries are valid only for moving queries on stationary objects, are valid only for stationary range queries. Thus, a system that manages infinite heterogeneous data streams must satisfy the following requirements 1. Mechanism for rate control must be in place to deal with data streams that are nearly always too fast for any indexing system. 2. There must be capacity control because the data streams are infinite while the storage space is finite. I. INTRODUCTION Most of spatio-temporal queries are continuous in nature. Unlike snapshot queries that are evaluated only once, continuous queries require continuous evaluation as the query result becomes invalid with the change of information. The Existing algorithms for continuous spatiotemporal queries aim to optimize the time interval between each two instances of the snapshot queries with the large number of continuous queries, reevaluating a continuous spatio-temporal query, even with large time intervals, poses a redundant processing for the locationaware servers. Those focus on evaluating only one spatio-temporal query In a typical location-aware server, there is a huge number of concurrently outstanding continuous spatiotemporal queries. Handling each query as an individual entity dramatically degrades the performance of the location-aware server. Fig1: B-Tree Indexing 3. The structure of the data varies in time (i.e., heterogeneity of data streams), therefore adaptive indexing methods must be developed. 4. Data should be efficiently stored. 5. Volume can be huge, therefore everything must be append-only; both indexing and query processing must only use forward-only file access. 6. Users can query and filter the index streams efficiently. 84

II. RELATED WORK Shifted-Wavelet Tree (SWT) is a wavelet-based summary structure proposed for detecting bursts of events on a stream of data [1]. For a given set of query windows SWT maintains moving aggregates using a wavelet tree for incremental computation. A window is monitored by the lowest level that satisfies windows condition. Therefore, associated with each level, there is a threshold equal to the smallest of the thresholds of windows monitored by that level. Whenever the moving sum at some level exceeds the level threshold all query windows associated with this level are checked using a brute force approach. MR-Index addresses variable length queries over time series data [2]. Wavelets are used to extract features from a time series at multiple resolutions. At each resolution, a set of feature vectors are combined into an MBR and stored sequentially in the order they are computed. A given query is decomposed into multiple sub-queries such that each sub-query has resolution corresponding to a resolution at the index. A given set of candidate MBRs are refined using each query as a filter to prune out non-potential candidates. Versions of piecewise constant approximation are proposed for time series similarity matching. Specifically, adaptive piecewise constant approximation (APCA) represents data regions of great fluctuations with several short segments, while data regions of less fluctuations are represented with fewer, long segments [3]. An extension of this approximation allows error specification for each point in time [4]. The resulting approach can approximate data with fidelity proportional to its age. GeneralMatch, a refreshingly new idea in similarity matching, divides the data sequences into disjoint windows, and the query sequence into sliding windows [5]. This approach is the dual of the conventional approaches, i.e., dividing the data sequence into sliding windows, and the query sequence into disjoint windows. The overall framework is based on answering pattern queries using a singleresolution index built on a specific choice of window size. The allowed window size depends on the minimum query length, which has to be provided a-priori before the index construction. Methods based on multi-variate linear regression are considered for analyzing co-evolving time sequences [6]. For a given stream S, its current value (dependent variable) is expressed as a linear combination of values of the same and other streams (independent variables) under sliding window model. Given v independent variables and a dependent variable y with N samples each, the model identifies the best b independent variables in order to compute the current value of the dependent variable in O(Nbv 2 ) time. Stat stream is a state of the art system proposed for monitoring a large number of streams in real time [7]. It subdivides the history of a stream into a fixed number of basic windows and maintains DFT coefficients for each basic window. This allows a batch update of DFT coefficients over the entire history. It superimposes an orthogonal regular grid on the feature space, and partitions the space into cells of diameter r, the correlation threshold. Each stream is mapped to a number of cells (exactly how many depends on the "lag time") in the feature space based on a subset of its DFT coefficients. It uses proximity in this feature space to report correlations [8]. III. FRACTALS Fractals are of rough or fragmented geometric shape that can be subdivided in parts, each of which is (at least approximately) a reduced copy of the whole. They are crinkly objects that defy conventional measures, such as length and are most often characterized by their fractal dimension They are mathematical sets with a high degree of geometrical complexity that can model many natural phenomena. Almost all natural objects can be observed as fractals (coastlines, trees, mountains, and clouds). Their fractal dimension strictly exceeds topological dimension Fractal dimension The number, very often non-integer, often the only one measure of fractals. It measures the degree of fractal boundary fragmentation or irregularity over multiple scales. It determines how fractal differs from Euclidean objects (point, line, plane, circle etc) Self-similarity/ Semi-self similarity Fractal is strictly self-similar if it can be expressed as a union of sets, each of which is an exactly reduced copy (is geometrically similar to) of the full set. The most fractal looking in nature do not display this precise form Natural objects are not union of exact reduced copies of whole. A magnified view of one part will not precisely reproduce the whole object, but it will have the same qualitative appearance. This property is called statistical self-similarity or semi-self-similarity How Fractal Tree Databases Work Insertion bottlenecks lie at the heart of database and file-system innovations, best practices, and system workarounds. Most databases and file systems are based on B-tree data structures, and suffer from the performance cliffs and unpredictable run times of B- trees. In this talk, we introduce the Fractal Tree data structure and explain how it works and how it provides 85

dramatically improved performance in both theory and in practice. From a theoretical perspective, if B is the blocktransfer size, the B-tree performs O(log N/log B) block transfers per insert in the worst case. In contrast, the Fractal Tree structure performs O((log N)/B) memory transfers per insert, which translates to run-time improvements of two orders of magnitude. To relate that theory to practice, we present an algorithmic model for B-tree performance bottlenecks. We explain how the bottlenecks affect best practice and how database designers typically modify B-trees to try to mitigate the bottlenecks. Then we show how Fractal Tree structures can attain faster insertion rates, intuitively by transforming disk-seek bottlenecks into disk-bandwidth bottlenecks. Fractal Tree data structure Advantages A drop-in B-tree replacement supporting fast insertions for high entropy data. 100x faster inserts good data locality with no memory-specific parameterization Data can be indexed quickly on disk, even if the data arrival order is independent of data-query order. Thus, insertion performance is a currency that we can use to achieve faster query performance through indexing Fractal Trees are generally faster, easier to implement, and platform independent Fractal Trees scale with disk bandwidth not seek time Properties of a Simplified Fractal Tree log N arrays, one array for each power of two Each array is completely full or empty Each array is sorted Searching in a Simplified Fractal Tree Fractal Tree indexes can use 1/100th the power of B-trees Fractal Tree Indexes There are same operations in B-trees and fractal trees. The main difference is at performance. The Random input output is replaced with sequential Input /output so faster. So, Fractal Trees index data at closer disk bandwidth rates, not on the structure of the primary and secondary keys, and have some range of queries that stream data and close disk bandwidth rates, even as the database increases. Hence, more indexes shall be maintained without a loss of performance. Adding data to the indexes makes the B-trees stress on performance of B-trees, but performs well in Fractal Tree indexes. Scope of Fractal Tree Indexes Fractal Tree indexes will make good use of cores. Fractal Tree indexes ride the right technology trends In the future, all storage systems will use Fractal Tree indexes Index maintenance As new values stream in, new features are computed and inserted into the corresponding index structures while features that are out of history of interest are deleted to save space. Coefficients are computed at multiple resolutions starting from level 0 up to a configurable level J: at each level a sliding window is used to extract the appropriate features. Computation of features at higher levels is accelerated using the MBRs at lower levels. The MBRs belonging to a specific stream are threaded together in order to provide a sequential access to the summary information about the stream. This approach results in a constant retrieval time of the MBRs. The complete algorithm is shown in Algorithm 1. Require: B MBR at level j for stream S. w := W (the window size at the lowest resolution); tn denotes the current discrete time; for j:=0 to J do B := the current MBR at level j for stream S; If j = 0 then y := S[tn - w + 1 : tn]; normalize y if F = DWT, F := F(y); else find MBR that contains the feature for the subsequence S[tn - w + 1 : tn - y]; find MBR B, S, tn at contains the feature for the subsequence S[tn - y + 1 : tn]; if number of features in B < c (box capacity) then insert F in to B ; else insert BSi into index at level j; start a new MBR B insert Fj into B adjust the sliding window size to w := w * 2; end for Algorithm 1: Algorithm for Indexing 86

Features at a given level are maintained in a high dimensional index structure. The index combines information from all the streams, and provides a scalable the access medium for answering queries over multiple data streams. However, each MBR inserted into the index is specific to a single stream. The R*-Tree family of index structures are used for indexing MBRs at each level [9]. An R*-Tree, a variant of R-Tree [10], is a spatial access method, which splits feature space in hierarchically nested, possibly overlapping MBRs. In order to support frequent updates, the techniques for predicting MBR boundaries outlined in can be used to decrease the cost of index maintenance. Querying Streams Assume that the query window size is a multiple of W. An aggregate query with window size w and threshold T is answered by first partitioning the window into multiple sub-windows. For a given window of length bw, the partitioning corresponds to the ones in the binary representation of b. The current aggregate over a window of size w is computed using the subaggregates for sub-windows in the partitioning. Assume that W = 2 and c = 2. Consider a query window w = 26. The binary representation of b = 13 is 1101, and therefore the query is partitioned into three sub-windows wo = 2, w2 = 8, and w3 = 16. The current aggregate over a window of size w = 26 is approximated using the extents of MBRs that contain the corresponding subaggregates. The computation is approximate in the sense that the algorithm returns an interval F such that the upper coordinate F[2]is always greater than or equal to the true aggregate. If F[2]is larger than the threshold T, the most recent subsequence of length w is retrieved, and the true aggregate is computed. If this value exceeds the query threshold, an alarm is raised. The complete algorithm is shown in Algorithm 2. Given Stream S, Window w, Threshold T initialize t to tn, the current discrete time; partition w into n parts as wl, w2,..., wn; initialize aggregate T; for i := 1 to i := n do find the resolution level j such that Wi = W2 j ; MBR B contains the feature on S[t - wi + 1 : t]; merge sub-aggregate B to T := F(B, F); adjust offset to t := t - Wi for next sub-window; end for if T < F[2] then retrieve S[tn - w + 1 : t,n]; raise an alarm; Algorithm 2: Algorithm for Querying IV. EXPERIMENTAL WORK We used both real life and synthetic data streams for conducting experiments. Our intention is to reduce the time taken for both indexing and querying on data streams, to find the effect of the frequency and magnitude of the records on Accuracy and to analyze the performance of our approach. The tests are done on a unix machine with a 2400 MHz CPU and 2GB main memory. The both Real life data and synthetic data have been used in our experiments. Adult Data. We use real life Adult data.we sampled the data from the records with in one period with five million transactional records. The attributes include age, Age, Fnl-wgt, Education-num, Capital_Gain, Capital Loss, Workclass, Education, Martial_status, Occupation, Relationship, Race, Sex, Native_Country etc. Synthetic Data. We create synthetic data that have been used to simulate time-changing concepts to analyise the performance of indexing and Querying on data streams. Time analysis, we analyze the time taken for insertion into our Fractal tree and B-tree. And we found that our tree is performing good work than the B-tree. The below figure(i.e.fig2). Fig. 2: Random Inserts into Fractal Tree and B-tree Querying Performance, We analyzed the performance of our Tree with B.tree in Querying the data. The Performance of Fractal tree with B-Tree is shown in the below figure(i.e. Fig 3). 87

Fig. 3: Searches in Fractal Tree B- Tree V. CONCLUSION AND FUTURE SCOPE We conclude with performance results showing how a Fractal Tree storage engine can maintain rich indexes more efficiently than B-trees. Surprisingly, Fractal Tree structures seem to maintain their order-ofmagnitude competitive advantage over B-trees on SSDs as well as traditional rotating media. Here we used centralized monitoring so, there is scope in distributing the monitoring system and content distribution over networks. VI. REFERENCES [1] Y. Zhu and D. Shasha. Efficient elastic burst detection in data streams. In SIGKDD, pages 336-345,2003. [2] T. Kahveci and A. Singh. Variable length queries for time series data. In ICDE, pages 273-282,2001. [3] E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. In SIGMOD, pages 15 1-162,2001. [4] T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In ICDE, pages 338-349,2004. [5] Y. Moon, K. Whang, and W. Han. General match: a subsequence matching method in timeseries databases based on generalized windows. In SIGMOD, pages 382-393,2002. [6] B. Yi, N. Sidiropoulos, T. Johnson, H. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-evolving time sequences. In ICDE,2000. [7] Y Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In VZDB, pages 358-369,2002. [8] J. Bentley, B. Weide, and A. Yao. Optimal expected time algorithms for closest point problems. In ACM Trans. on Math. Software, volume 6, pages 563-580,1980. [9] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, pages 322-33 1,1990. [10] Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, pages 47-57,1984. 88