TiP: Analyzing Periodic Time Series Patterns

Similar documents
ProUD: Probabilistic Ranking in Uncertain Databases

Interval-focused Similarity Search in Time Series Databases

Generalizing the Optimality of Multi-Step k-nearest Neighbor Query Processing

Similarity Search in Multimedia Time Series Data using Amplitude-Level Features

Similarity Search on Time Series based on Threshold Queries

Reverse k-nearest Neighbor Search in Dynamic and General Metric Databases

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Database support for concurrent digital mock up

Visualizing and Animating Search Operations on Quadtrees on the Worldwide Web

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree

Chapter 5: Outlier Detection

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

CS490D: Introduction to Data Mining Prof. Chris Clifton

Spatial Outlier Detection

Pattern Mining in Frequent Dynamic Subgraphs

Introduction to Trajectory Clustering. By YONGLI ZHANG

Spatiotemporal Access to Moving Objects. Hao LIU, Xu GENG 17/04/2018

Data Representation in Visualisation

Subspace Similarity Search: Efficient k-nn Queries in Arbitrary Subspaces

Knowledge Discovery in Databases II Summer Term 2017

Generating Traffic Data

Hierarchical Density-Based Clustering for Multi-Represented Objects

City, University of London Institutional Repository

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

Managing Uncertain Spatio-Temporal Data

Detection of Outliers

In Proc. 4th Asia Pacific Bioinformatics Conference (APBC'06), Taipei, Taiwan, 2006.

Similarity Joins in MapReduce

Unsupervised Distributed Clustering

Chapter 1: Introduction to Spatial Databases

Knowledge-based pattern recognition and visualization of error logs of time-based engine sensor data: Requirements engineering and tool-support

Shape fitting and non convex data analysis

Research on time optimal trajectory planning of 7-DOF manipulator based on genetic algorithm

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Comparison of spatial indexes

A Framework for A Graph- and Queuing System-Based Pedestrian Simulation

An Efficient Bayesian Nearest Neighbor Search Using Marginal Object Weight Ranking Scheme in Spatial Databases

Introduction to Geospatial Analysis

Inverse Queries For Multidimensional Spaces

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Multidimensional Data and Modelling - Operations

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report

Inverse Queries For Multidimensional Spaces

Accelerometer Gesture Recognition

Approximate Clustering of Time Series Using Compact Model-based Descriptions

Context based optimal shape coding

9/23/2009 CONFERENCES CONTINUOUS NEAREST NEIGHBOR SEARCH INTRODUCTION OVERVIEW PRELIMINARY -- POINT NN QUERIES

Temporal Range Exploration of Large Scale Multidimensional Time Series Data

Faster Clustering with DBSCAN

Quality Metrics for Visual Analytics of High-Dimensional Data

Towards New Heterogeneous Data Stream Clustering based on Density

Robust PDF Table Locator

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Analysis and Extensions of Popular Clustering Algorithms

Color Mining of Images Based on Clustering

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

Context Aware Routing in Sensor Networks

Continuous Nearest Neighbor Search

A Robust Wipe Detection Algorithm

Sequences Modeling and Analysis Based on Complex Network

A New Approach for Shape Dissimilarity Retrieval Based on Curve Evolution and Ant Colony Optimization

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

ADCN: An Anisotropic Density-Based Clustering Algorithm for Discovering Spatial Point Patterns with Noise

Structure-Based Similarity Search with Graph Histograms

Materialized Data Mining Views *

Geometric and Thematic Integration of Spatial Data into Maps

A Novel Method to Estimate the Route and Travel Time with the Help of Location Based Services

Towards Rule Learning Approaches to Instance-based Ontology Matching

A METHOD FOR CONTENT-BASED SEARCHING OF 3D MODEL DATABASES

Effective Pattern Similarity Match for Multidimensional Sequence Data Sets

Time Series Analysis DM 2 / A.A

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Scaling the Topology of Symmetric, Second-Order Planar Tensor Fields

Downloaded from

Indexing the Positions of Continuously Moving Objects

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti

The 3 Voxet axes can be annotated: Menu > Voxet > Tools > Annotate Voxet > Custom system.

Supplementary Figure 1. Decoding results broken down for different ROIs

AN IMPROVED DENSITY BASED k-means ALGORITHM

SIR C R REDDY COLLEGE OF ENGINEERING

Time-of-Flight Surface De-noising through Spectral Decomposition

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Line Clipping in E 3 with Expected Complexity O(1)

Efficient Spatial Query Processing in Geographic Database Systems

Similarity Search in Time Series Databases

Development of Efficient & Optimized Algorithm for Knowledge Discovery in Spatial Database Systems

Searching for Similar Trajectories in Spatial Networks

DeLiClu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking

Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity

Using R-trees for Interactive Visualization of Large Multidimensional Datasets

Web Data mining-a Research area in Web usage mining

Evolving Region Moving Object Random Dataset Generator

The Effects of Dimensionality Curse in High Dimensional knn Search

Voronoi-based Nearest Neighbor Search for Multi-Dimensional Uncertain Databases

Adaptive k-nearest-neighbor Classification Using a Dynamic Number of Nearest Neighbors

Knot Insertion and Reparametrization of Interval B-spline Curves

Francisco Arumi-Noe, Ph.D. School of Architecture University of Texas, Austin Texas A B S T R A C T

OPTICS-OF: Identifying Local Outliers

Finding Similarity in Time Series Data by Method of Time Weighted Moments

Transcription:

ip: Analyzing Periodic ime eries Patterns homas Bernecker, Hans-Peter Kriegel, Peer Kröger, and Matthias Renz Institute for Informatics, Ludwig-Maximilians-Universität München Oettingenstr. 67, 80538 München, Germany {bernecker,kriegel,kroeger,renz}@dbs.ifi.lmu.de http://www.dbs.ifi.lmu.de Abstract. ime series of sensor databases and scientific time series often consist of periodic patterns. Examples can be found in environmental analysis, where repeated measurements of climatic attributes like temperature, humidity or barometric pressure are taken. Depending on season-specific meteorological influences, curves of consecutive days can be strongly related to each other, whereas days of different seasons show different characteristics. Analyzing such phenomena could be very valuable for many application domains. Convenient similarity models that support similarity queries and mining based on periodic patterns are realized in the framework ip, which provides methods for the comparison of similarity query results based on different threshold-based feature extraction methods. In this demonstration, we present the visual and analytical methods of ip of detecting and evaluating periodic patterns in time series using the example of environmental data. 1 Introduction In a large range of application domains, e.g. environmental analysis, evolution of stock charts, research on medical behavior of organisms, or analysis and detection of motion activities, we are faced with time series data featuring activities which are composed of regularly repeating sequences of activity events. For that purpose, existing periodic patterns that repeatedly occur in specified periods over time have to be considered. hough consecutive motion patterns show similar characteristics, they are not equal. We can observe changes of significant importance in the shape of consecutive periodic patterns. ip utilizes the dual-domain representation of time series and the thresholdbased approach [1] to extract periodic patterns from them. Beyond the interest of [2], the temporal location and the evolution of consecutive patterns are focused. For efficient similarity computation, relevant feature information is extracted so that data mining techniques can apply. By visualization, ip provides first information about the existence and the location of periodic patterns in the time domains in the 3D space. Further knowledge about large datasets and the choice of adequate parameter settings for similarity queries and mining can be obtained by diverse analysis methods.

ime eries X (August 2003) ime eries Y (January 2001) τ τ Intersection et P τ (X) Intersection et P τ (Y) Features {x 1,, x m } d τ (X,Y) Features {y 1,, y n } Fig. 1. Feature extraction and similarity computation on dual-domain time series. Overall, ip serves as a framework to effectively and efficiently manage dualdomain time series. Furthermore, it provides several methods to process similarity queries on this type of time series based on user-defined features as well as an intuitive graphical user interface. heoretical background of the applied techniques in ip will be given in ection 2. Details of the system and the architecture will be explained in ection 3. ection 4 describes the planned demo tour in more detail. 2 heoretical Background ime eries Representation A time series X can be split into a sequence of subsequences of fixed length by adding an additional time domain [1]. his yields a dual-domain representation having a 3D surface. A dual-domain time series represents the temporal behavior along two time axes. Consider an example of environmental research. he trend of temperature within one month of a year having one value for each hour of a day is then represented by the evolution of the temperature within one day (first time domain), and, additionally, over consecutive days of the entire time period (second time domain). Formally, a dual-domain time series is defined as X dual = x 1,1,..., x 1,N 1, x 1,N,..., x M,1,..., x M,N 1, x M,N where x i,j denotes the value of the time series at time slot i in the first (discrete) time domain = {t 1,..., t N } and at time slot j in the second (discrete) time domain = {s 1,..., s M }.

ime eries imilarity ip compares dual-domain time series based on their periodic patterns that we call intersection sets. An intersection set P τ (X) of a time series X is created by a set of polygons that are derived according to [2], where time series with one time domain are represented as a sequence of intervals w.r.t. to a threshold value τ. In this case, τ corresponds to a 2D plane which intersects the 3D time series X. he polygons of an intersection set represent the evolution of amplitude-level patterns as spatial objects, as they deliver all information about the periods of time during which the amplitudes of the time series exceed τ. In the temperature example, the polygons contain only values beyond a certain temperature level. hus, the patterns emerge from temperature values that are greater than τ. However, patterns of summer and winter months are different in their occurrence and their dimensions, as depicted in Figure 1: the patterns of August 2003 are more concentrated within the days, but they are more constant over the whole month, whereas January 2001 contains patterns having a lower variance over consecutive days but that hardly change within the days. he degree of periodicity of a time series is reflected by the extent of the polygons, so even non-periodic patterns can be detected and addressed by their spatial location. he distance d τ (X, Y ) of two time series X and Y reflects the dissimilarity of the intersection sets P τ (X) and P τ (Y ) for a certain τ. his step reduces the comparison of time series to comparing the sets of polygons such that spatial similarity computation methods can apply. o save computational cost while computing the distance as well as to allow for the usage of index structures like the R -tree [3], ip derives local features (e.g. approximations or numerical values) from the polygons and global features (e.g. characteristics) from the intersection sets and computes the distance based on them (cf. Figure 1). he number of polygons and, thus, the number of local features varies for different intersection sets, so ip employs the um of Minimal Distances (MD) measure [4]. he distance based on global features is simply calculated as the L p -distance between the associated features, as, for each intersection set, ip derives the same amount of global features. For similarity calculation and further analysis, intersection sets for different threshold values can be pre-computed for a dataset and stored in an underlying database. hus, relevant feature information of intersection sets can simply be reloaded for similarity queries. Furthermore, this threshold-based method is very efficient as, contrary to simple measures like the Euclidean distance, it does not need to consider the originally highdimensional structure of time series. 3 Architecture ip has been implemented using Java and the Java3D API. he framework is able to load datasets of time series following the ARFF format 1 into the application. For each time series, the user can set several threshold planes to compute intersection sets w.r.t. these threshold values. he polygons can be approximated by simple conservative bounds like minimum bounding rectangles 1 http://weka.wiki.sourceforge.net/arff

(a) Dual-domain time series visualization. (b) 2D view. (c) iview displaying polygons and features. Fig. 2. Visual exploration of dual-domain time series with ip. (MBRs) and thus be stored efficiently in the internal database by the support of internal index structures. his also accelerates similarity queries, as multi-step query processing can be applied. For a time series X, each intersection set P τ (X) w.r.t. a specific value for the threshold τ does only have to be computed once. Once stored in the internal database, the ID of P τ (X) can simply be associated with the ID of the time series X from which was derived when X is opened by ip for another time. he user is able to perform an extensive evaluation using several techniques to mine databases of time series that are based on periodic patterns, such as k-nearest-neighbor (knn) queries with distance ranking, precision-recall analysis and knn classification. ip provides an intuitive graphical user interface. Exemplary elements are depicted in Figure 2. he core elements include two Java3D applets displaying two time series simultaneously in their dual-domain representation (cf. Fig. 2(a)). As several user-defined threshold planes can be set for each time series, ip offers two additional applets called iview to depict the polygons and the presentable features (i.e. approximations) of the corresponding intersection sets. Additionally, the user is able to split a time series along each axis. hen each subsequence of a dual-domain time series can be displayed in an external 2D chart (cf. Fig. 2(b)). 4 Demo our In this section, a short overview of the nature of the demonstration of ip is given, which will be performed applying the example of environmental research where the time series are built of normalized temperature measurements. he recorded and prepared data, labelled corresponding to the four seasons of a year, was provided by the Environmental tate Office of Bavaria, Germany 2. 2 http://www.lfu.bayern.de/

In the demo, we first present how the dual-domain time series representation can discover a higher content of information compared to the original, sequential representation. For example, depending on the season, the temperature curves show different characteristics in their course, so representative patterns can already be derived visually from this 3D surface. When setting threshold planes arbitrarily, the user can materialize those patterns. By setting several threshold planes, the user can examine how to find suitable values for τ that should be used while performing similarity queries. In general, the higher a threshold is, the less is the number of amplitude values of a time series that exceed the threshold and, thus, contribute to a polygon of the intersection set. By defining a value or a range for τ, we show how to define the relevant time series values for the queries. A high τ value, e.g. τ = 80 F, in the temperature example corresponds to the query: Given the curve of a query month showing temperature values higher than 80 F on the first days at one certain time of day, please return all months that also show values higher than 80 F on their first days at this time of day.. If the query time series contains measurements of a summer month, it is likely to get summer months returned in the result set, as winter months show a different behavior, even if the amplitudes of the dataset are normalized. As an additional feature, the system supports the automatic detection of the value for τ that yields the best results without specifying any minimum query temperature level. For that purpose, a knn classification is performed in order to find a τ value to obtain the best accuracy results. Exemplarily for the normalized temperature dataset, choosing τ = 0.6 yields an accuracy value of 0.65, where by applying the Euclidean distance we get an accuracy of 0.56. o get an overview of the quality of the results that can be achieved with different values for τ, a parameter analysis shows the classification accuracies and average precision values for every possible τ within the amplitude range of the time series. Of course the result strongly depends on the selection of the features of the intersection sets. herefore, we show that features can be combined to improve the expressiveness of the results. References 1. J. Aßfalg,. Bernecker, H.-P. Kriegel, P. Kröger, and M. Renz. Periodic pattern analysis in time series databases. In Database ystems for Advanced Applications, 14th International Conference, DAFAA 2009, Brisbane, Australia, April 21-23, 2009, pages 354 368, 2009. 2. J. Aßfalg, H.-P. Kriegel, P. Kröger, P. Kunath, A. Pryakhin, and M. Renz. hreshold similarity queries in large time series databases. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3-8 April 2006, Atlanta, GA, UA, page 149, 2006. 3. N. Beckmann, H.-P. Kriegel, R. chneider, and B. eeger. he R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM IGMOD International Conference on Management of Data, Atlantic City, NJ, May 23-25, 1990, pages 322 331, 1990. 4.. Eiter and H. Mannila. Distance measures for point sets and their computation. Acta Informatica, 34(2):109 133, 1997.