Fractal Based Approach for Indexing and Querying Heterogeneous Data Streams

Similar documents
A Unified Framework for Monitoring Data Streams in Real Time

The Grid File: An Adaptable, Symmetric Multikey File Structure

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

Experimental Evaluation of Spatial Indices with FESTIval

Indexing by Shape of Image Databases Based on Extended Grid Files

Determining Differences between Two Sets of Polygons

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Similarity Search in Time Series Databases

Introduction to Spatial Database Systems

Data Organization and Processing

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree

COMP 465: Data Mining Still More on Clustering

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

Temporal Range Exploration of Large Scale Multidimensional Time Series Data

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Multidimensional Data and Modelling - DBMS

SPATIOTEMPORAL INDEXING MECHANISM BASED ON SNAPSHOT-INCREMENT

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Striped Grid Files: An Alternative for Highdimensional

Effective Pattern Similarity Match for Multidimensional Sequence Data Sets

Approximation Algorithms for Geometric Intersection Graphs

Spatiotemporal Access to Moving Objects. Hao LIU, Xu GENG 17/04/2018

CMSC 754 Computational Geometry 1

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Collision and Proximity Queries

Module 8: Video Coding Basics Lecture 42: Sub-band coding, Second generation coding, 3D coding. The Lecture Contains: Performance Measures

Extending Rectangle Join Algorithms for Rectilinear Polygons

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Max-Count Aggregation Estimation for Moving Points

ADS: The Adaptive Data Series Index

Conflict Serializable Scheduling Protocol for Clipping Indexing in Multidimensional Database

An Introduction to Spatial Databases

Iteration Reduction K Means Clustering Algorithm

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Distributed k-nn Query Processing for Location Services

Foundations of Computing

Joint Entity Resolution

Clustering and Visualisation of Data

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Leveraging Set Relations in Exact Set Similarity Join

Keywords: clustering algorithms, unsupervised learning, cluster validity

26 The closest pair problem

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

A Review on Cluster Based Approach in Data Mining

Big Data Analytics CSCI 4030

Continuous Query Processing in Spatio-temporal Databases

Generation of 3D Fractal Images for Mandelbrot and Julia Sets

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation

Challenges in Ubiquitous Data Mining

CHAPTER 3 DIFFERENT DOMAINS OF WATERMARKING. domain. In spatial domain the watermark bits directly added to the pixels of the cover

Constructing a G(N, p) Network

Collaborative Rough Clustering

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

Lecturer 2: Spatial Concepts and Data Models

References. Additional lecture notes for 2/18/02.

A Framework for Clustering Massive Text and Categorical Data Streams

Implementation Techniques

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Module 7 VIDEO CODING AND MOTION ESTIMATION

COMP3121/3821/9101/ s1 Assignment 1

Computational Geometry

Multiresolution Motif Discovery in Time Series

Density Based Clustering using Modified PSO based Neighbor Selection

Data Preprocessing. Slides by: Shree Jaswal

R-Trees. Accessing Spatial Data

Motion Planning. O Rourke, Chapter 8

9. Three Dimensional Object Representations

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

Time Series Analysis DM 2 / A.A

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

G 6i try. On the Number of Minimal 1-Steiner Trees* Discrete Comput Geom 12:29-34 (1994)

Comparative Study of Subspace Clustering Algorithms

Motion Estimation for Video Coding Standards

Points covered an odd number of times by translates

M. Andrea Rodríguez-Tastets. I Semester 2008

Evaluating Continuous Nearest Neighbor Queries for Streaming Time Series via Pre-fetching

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Multidimensional Data and Modelling

Clustering from Data Streams

A Parallel Access Method for Spatial Data Using GPU

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

Chap4: Spatial Storage and Indexing. 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

66 III Complexes. R p (r) }.

UNIT 2 Data Preprocessing

Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1

[10] Industrial DataMatrix barcodes recognition with a random tilt and rotating the camera

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Design and Analysis of an Euler Transformation Algorithm Applied to Full-Polarimetric ISAR Imagery

Iterated Functions Systems and Fractal Coding

Clustering Of Ecg Using D-Stream Algorithm

Wavelet Based Image Compression Using ROI SPIHT Coding

A Survey on Disk-based Genome. Sequence Indexing

Time series representations: a state-of-the-art

Analyzing Data Streams by Online DFT

Transcription:

Fractal Based Approach for Indexing and Querying Heterogeneous Data Streams G. Netaji & MHM. Krishna Prasad Dept. of IT, JNTUK UCEV, Vizianagaram, A.P., 535 003, India E-mail: netaji.gandi@gmail.com, krishnaprasad.mhm@gmail.com Abstract Efficient handling of data stream is become mandatory in various domains such as business, finance, sensor networks etc. Managing these data streams poses challenges that are distinct from those addressed by traditional database management systems. The enormous increase of cellular phones, GPS-like devices, and RFIDs results in highly dynamic environments where objects as well as queries are continuously moving. This paper presents a continuous query processor designed specifically for highly dynamic environments. This paper presents a data stream indexing system that satisfies the requirements in evolving data streams. Finally, applying the fractal based indexing scheme that enable the system for fast indexing and querying. This paper also address the issue of memory based and disk based fractal management to support skew distributions of data stream. Keywords Data stream, Indexing, Querying, Self similarity, Fractal Dimension, Fractal Tree. The existing query processing techniques focus on solving special cases of continuous spatio-temporal queries are valid only for moving queries on stationary objects, are valid only for stationary range queries. Thus, a system that manages infinite heterogeneous data streams must satisfy the following requirements 1. Mechanism for rate control must be in place to deal with data streams that are nearly always too fast for any indexing system. 2. There must be capacity control because the data streams are infinite while the storage space is finite. I. INTRODUCTION Most of spatio-temporal queries are continuous in nature. Unlike snapshot queries that are evaluated only once, continuous queries require continuous evaluation as the query result becomes invalid with the change of information. The Existing algorithms for continuous spatiotemporal queries aim to optimize the time interval between each two instances of the snapshot queries with the large number of continuous queries, reevaluating a continuous spatio-temporal query, even with large time intervals, poses a redundant processing for the locationaware servers. Those focus on evaluating only one spatio-temporal query In a typical location-aware server, there is a huge number of concurrently outstanding continuous spatiotemporal queries. Handling each query as an individual entity dramatically degrades the performance of the location-aware server. Fig1: B-Tree Indexing 3. The structure of the data varies in time (i.e., heterogeneity of data streams), therefore adaptive indexing methods must be developed. 4. Data should be efficiently stored. 5. Volume can be huge, therefore everything must be append-only; both indexing and query processing must only use forward-only file access. 6. Users can query and filter the index streams efficiently. 84

II. RELATED WORK Shifted-Wavelet Tree (SWT) is a wavelet-based summary structure proposed for detecting bursts of events on a stream of data [1]. For a given set of query windows SWT maintains moving aggregates using a wavelet tree for incremental computation. A window is monitored by the lowest level that satisfies windows condition. Therefore, associated with each level, there is a threshold equal to the smallest of the thresholds of windows monitored by that level. Whenever the moving sum at some level exceeds the level threshold all query windows associated with this level are checked using a brute force approach. MR-Index addresses variable length queries over time series data [2]. Wavelets are used to extract features from a time series at multiple resolutions. At each resolution, a set of feature vectors are combined into an MBR and stored sequentially in the order they are computed. A given query is decomposed into multiple sub-queries such that each sub-query has resolution corresponding to a resolution at the index. A given set of candidate MBRs are refined using each query as a filter to prune out non-potential candidates. Versions of piecewise constant approximation are proposed for time series similarity matching. Specifically, adaptive piecewise constant approximation (APCA) represents data regions of great fluctuations with several short segments, while data regions of less fluctuations are represented with fewer, long segments [3]. An extension of this approximation allows error specification for each point in time [4]. The resulting approach can approximate data with fidelity proportional to its age. GeneralMatch, a refreshingly new idea in similarity matching, divides the data sequences into disjoint windows, and the query sequence into sliding windows [5]. This approach is the dual of the conventional approaches, i.e., dividing the data sequence into sliding windows, and the query sequence into disjoint windows. The overall framework is based on answering pattern queries using a singleresolution index built on a specific choice of window size. The allowed window size depends on the minimum query length, which has to be provided a-priori before the index construction. Methods based on multi-variate linear regression are considered for analyzing co-evolving time sequences [6]. For a given stream S, its current value (dependent variable) is expressed as a linear combination of values of the same and other streams (independent variables) under sliding window model. Given v independent variables and a dependent variable y with N samples each, the model identifies the best b independent variables in order to compute the current value of the dependent variable in O(Nbv 2 ) time. Stat stream is a state of the art system proposed for monitoring a large number of streams in real time [7]. It subdivides the history of a stream into a fixed number of basic windows and maintains DFT coefficients for each basic window. This allows a batch update of DFT coefficients over the entire history. It superimposes an orthogonal regular grid on the feature space, and partitions the space into cells of diameter r, the correlation threshold. Each stream is mapped to a number of cells (exactly how many depends on the "lag time") in the feature space based on a subset of its DFT coefficients. It uses proximity in this feature space to report correlations [8]. III. FRACTALS Fractals are of rough or fragmented geometric shape that can be subdivided in parts, each of which is (at least approximately) a reduced copy of the whole. They are crinkly objects that defy conventional measures, such as length and are most often characterized by their fractal dimension They are mathematical sets with a high degree of geometrical complexity that can model many natural phenomena. Almost all natural objects can be observed as fractals (coastlines, trees, mountains, and clouds). Their fractal dimension strictly exceeds topological dimension Fractal dimension The number, very often non-integer, often the only one measure of fractals. It measures the degree of fractal boundary fragmentation or irregularity over multiple scales. It determines how fractal differs from Euclidean objects (point, line, plane, circle etc) Self-similarity/ Semi-self similarity Fractal is strictly self-similar if it can be expressed as a union of sets, each of which is an exactly reduced copy (is geometrically similar to) of the full set. The most fractal looking in nature do not display this precise form Natural objects are not union of exact reduced copies of whole. A magnified view of one part will not precisely reproduce the whole object, but it will have the same qualitative appearance. This property is called statistical self-similarity or semi-self-similarity How Fractal Tree Databases Work Insertion bottlenecks lie at the heart of database and file-system innovations, best practices, and system workarounds. Most databases and file systems are based on B-tree data structures, and suffer from the performance cliffs and unpredictable run times of B- trees. In this talk, we introduce the Fractal Tree data structure and explain how it works and how it provides 85

dramatically improved performance in both theory and in practice. From a theoretical perspective, if B is the blocktransfer size, the B-tree performs O(log N/log B) block transfers per insert in the worst case. In contrast, the Fractal Tree structure performs O((log N)/B) memory transfers per insert, which translates to run-time improvements of two orders of magnitude. To relate that theory to practice, we present an algorithmic model for B-tree performance bottlenecks. We explain how the bottlenecks affect best practice and how database designers typically modify B-trees to try to mitigate the bottlenecks. Then we show how Fractal Tree structures can attain faster insertion rates, intuitively by transforming disk-seek bottlenecks into disk-bandwidth bottlenecks. Fractal Tree data structure Advantages A drop-in B-tree replacement supporting fast insertions for high entropy data. 100x faster inserts good data locality with no memory-specific parameterization Data can be indexed quickly on disk, even if the data arrival order is independent of data-query order. Thus, insertion performance is a currency that we can use to achieve faster query performance through indexing Fractal Trees are generally faster, easier to implement, and platform independent Fractal Trees scale with disk bandwidth not seek time Properties of a Simplified Fractal Tree log N arrays, one array for each power of two Each array is completely full or empty Each array is sorted Searching in a Simplified Fractal Tree Fractal Tree indexes can use 1/100th the power of B-trees Fractal Tree Indexes There are same operations in B-trees and fractal trees. The main difference is at performance. The Random input output is replaced with sequential Input /output so faster. So, Fractal Trees index data at closer disk bandwidth rates, not on the structure of the primary and secondary keys, and have some range of queries that stream data and close disk bandwidth rates, even as the database increases. Hence, more indexes shall be maintained without a loss of performance. Adding data to the indexes makes the B-trees stress on performance of B-trees, but performs well in Fractal Tree indexes. Scope of Fractal Tree Indexes Fractal Tree indexes will make good use of cores. Fractal Tree indexes ride the right technology trends In the future, all storage systems will use Fractal Tree indexes Index maintenance As new values stream in, new features are computed and inserted into the corresponding index structures while features that are out of history of interest are deleted to save space. Coefficients are computed at multiple resolutions starting from level 0 up to a configurable level J: at each level a sliding window is used to extract the appropriate features. Computation of features at higher levels is accelerated using the MBRs at lower levels. The MBRs belonging to a specific stream are threaded together in order to provide a sequential access to the summary information about the stream. This approach results in a constant retrieval time of the MBRs. The complete algorithm is shown in Algorithm 1. Require: B MBR at level j for stream S. w := W (the window size at the lowest resolution); tn denotes the current discrete time; for j:=0 to J do B := the current MBR at level j for stream S; If j = 0 then y := S[tn - w + 1 : tn]; normalize y if F = DWT, F := F(y); else find MBR that contains the feature for the subsequence S[tn - w + 1 : tn - y]; find MBR B, S, tn at contains the feature for the subsequence S[tn - y + 1 : tn]; if number of features in B < c (box capacity) then insert F in to B ; else insert BSi into index at level j; start a new MBR B insert Fj into B adjust the sliding window size to w := w * 2; end for Algorithm 1: Algorithm for Indexing 86

Features at a given level are maintained in a high dimensional index structure. The index combines information from all the streams, and provides a scalable the access medium for answering queries over multiple data streams. However, each MBR inserted into the index is specific to a single stream. The R*-Tree family of index structures are used for indexing MBRs at each level [9]. An R*-Tree, a variant of R-Tree [10], is a spatial access method, which splits feature space in hierarchically nested, possibly overlapping MBRs. In order to support frequent updates, the techniques for predicting MBR boundaries outlined in can be used to decrease the cost of index maintenance. Querying Streams Assume that the query window size is a multiple of W. An aggregate query with window size w and threshold T is answered by first partitioning the window into multiple sub-windows. For a given window of length bw, the partitioning corresponds to the ones in the binary representation of b. The current aggregate over a window of size w is computed using the subaggregates for sub-windows in the partitioning. Assume that W = 2 and c = 2. Consider a query window w = 26. The binary representation of b = 13 is 1101, and therefore the query is partitioned into three sub-windows wo = 2, w2 = 8, and w3 = 16. The current aggregate over a window of size w = 26 is approximated using the extents of MBRs that contain the corresponding subaggregates. The computation is approximate in the sense that the algorithm returns an interval F such that the upper coordinate F[2]is always greater than or equal to the true aggregate. If F[2]is larger than the threshold T, the most recent subsequence of length w is retrieved, and the true aggregate is computed. If this value exceeds the query threshold, an alarm is raised. The complete algorithm is shown in Algorithm 2. Given Stream S, Window w, Threshold T initialize t to tn, the current discrete time; partition w into n parts as wl, w2,..., wn; initialize aggregate T; for i := 1 to i := n do find the resolution level j such that Wi = W2 j ; MBR B contains the feature on S[t - wi + 1 : t]; merge sub-aggregate B to T := F(B, F); adjust offset to t := t - Wi for next sub-window; end for if T < F[2] then retrieve S[tn - w + 1 : t,n]; raise an alarm; Algorithm 2: Algorithm for Querying IV. EXPERIMENTAL WORK We used both real life and synthetic data streams for conducting experiments. Our intention is to reduce the time taken for both indexing and querying on data streams, to find the effect of the frequency and magnitude of the records on Accuracy and to analyze the performance of our approach. The tests are done on a unix machine with a 2400 MHz CPU and 2GB main memory. The both Real life data and synthetic data have been used in our experiments. Adult Data. We use real life Adult data.we sampled the data from the records with in one period with five million transactional records. The attributes include age, Age, Fnl-wgt, Education-num, Capital_Gain, Capital Loss, Workclass, Education, Martial_status, Occupation, Relationship, Race, Sex, Native_Country etc. Synthetic Data. We create synthetic data that have been used to simulate time-changing concepts to analyise the performance of indexing and Querying on data streams. Time analysis, we analyze the time taken for insertion into our Fractal tree and B-tree. And we found that our tree is performing good work than the B-tree. The below figure(i.e.fig2). Fig. 2: Random Inserts into Fractal Tree and B-tree Querying Performance, We analyzed the performance of our Tree with B.tree in Querying the data. The Performance of Fractal tree with B-Tree is shown in the below figure(i.e. Fig 3). 87

Fig. 3: Searches in Fractal Tree B- Tree V. CONCLUSION AND FUTURE SCOPE We conclude with performance results showing how a Fractal Tree storage engine can maintain rich indexes more efficiently than B-trees. Surprisingly, Fractal Tree structures seem to maintain their order-ofmagnitude competitive advantage over B-trees on SSDs as well as traditional rotating media. Here we used centralized monitoring so, there is scope in distributing the monitoring system and content distribution over networks. VI. REFERENCES [1] Y. Zhu and D. Shasha. Efficient elastic burst detection in data streams. In SIGKDD, pages 336-345,2003. [2] T. Kahveci and A. Singh. Variable length queries for time series data. In ICDE, pages 273-282,2001. [3] E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. In SIGMOD, pages 15 1-162,2001. [4] T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In ICDE, pages 338-349,2004. [5] Y. Moon, K. Whang, and W. Han. General match: a subsequence matching method in timeseries databases based on generalized windows. In SIGMOD, pages 382-393,2002. [6] B. Yi, N. Sidiropoulos, T. Johnson, H. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-evolving time sequences. In ICDE,2000. [7] Y Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In VZDB, pages 358-369,2002. [8] J. Bentley, B. Weide, and A. Yao. Optimal expected time algorithms for closest point problems. In ACM Trans. on Math. Software, volume 6, pages 563-580,1980. [9] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, pages 322-33 1,1990. [10] Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, pages 47-57,1984. 88