Multiresolution Motif Discovery in Time Series

Similar documents
Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

Multivariate Time Series Classification Using Inter-leaved Shapelets

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Online Mining of Frequent Query Trees over XML Data Streams

More Efficient Classification of Web Content Using Graph Sampling

Automatic Learning of Predictive CEP Rules Bridging the Gap between Data Mining and Complex Event Processing

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Searching and mining sequential data

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Anomaly Detection on Data Streams with High Dimensional Data Environment

Temporal Weighted Association Rule Mining for Classification

Multi-resolution image recognition. Jean-Baptiste Boin Roland Angst David Chen Bernd Girod

Centroid Decomposition Based Recovery for Segmented Time Series

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Basics of Performance Engineering

Data Aggregation and Roadside Unit Placement for a VANET Traffic Information System

Implementing Synchronous Counter using Data Mining Techniques

Storage Hierarchy Management for Scientific Computing

Progress Report: Collaborative Filtering Using Bregman Co-clustering

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Online Discovery of Top-k Similar Motifs in Time Series Data

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

Detection of Missing Values from Big Data of Self Adaptive Energy Systems

Drug Consumption Prediction through Temporal Pattern Matching

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

EAST Representation: Fast Discriminant Temporal Patterns Discovery From Time Series

Elastic Partial Matching of Time Series

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Efficient Subsequence Search on Streaming Data Based on Time Warping Distance

Pattern Mining in Frequent Dynamic Subgraphs

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

José Miguel Hernández Lobato Zoubin Ghahramani Computational and Biological Learning Laboratory Cambridge University

Clustering part II 1

Social Behavior Prediction Through Reality Mining

An Improved Apriori Algorithm for Association Rules

Association Rule Mining in The Wider Context of Text, Images and Graphs

A Review on Cluster Based Approach in Data Mining

Spatial Outlier Detection

DATA MINING II - 1DL460

CHAPTER 7 CONCLUSION AND FUTURE WORK

D B M G Data Base and Data Mining Group of Politecnico di Torino

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

Part I: Data Mining Foundations

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Speeding up Queries in a Leaf Image Database

Sequences Modeling and Analysis Based on Complex Network

Estimating Quantiles from the Union of Historical and Streaming Data

Data mining fundamentals

Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery

Managing and mining (streaming) sensor data

Motion Detection Algorithm

Sensor Based Time Series Classification of Body Movement

Cardinality Estimation: An Experimental Survey

Evaluation of Power Consumption of Modified Bubble, Quick and Radix Sort, Algorithm on the Dual Processor

RECOMMENDATION ITU-R BT.1720 *

Symbolic Representation and Clustering of Bio-Medical Time-Series Data Using Non-Parametric Segmentation and Cluster Ensemble

Active Blocking Scheme Learning for Entity Resolution

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

ADS: The Adaptive Data Series Index

Analyzing Time-Series Data. Presentation by Colin Shea-Blymyer

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

Fundamentals of the Analysis of Algorithm Efficiency

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Large-Scale Flight Phase identification from ADS-B Data Using Machine Learning Methods

PROBLEM FORMULATION AND RESEARCH METHODOLOGY

A Ns2 model for the Xbox System Link game Halo

Ensemble of Bayesian Filters for Loop Closure Detection

node2vec: Scalable Feature Learning for Networks

SCA Reporter Templates

Frequent Itemsets Melange

Transport Protocol (IEX-TP)

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms

Network Traffic Characteristics of Data Centers in the Wild. Proceedings of the 10th annual conference on Internet measurement, ACM

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

9. Conclusions. 9.1 Definition KDD

Succinct Data Structures: Theory and Practice

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

arxiv: v4 [cs.lg] 14 Aug 2018

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Finding a needle in Haystack: Facebook's photo storage

Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms

Fundamentals of Information Systems, Seventh Edition

1 (eagle_eye) and Naeem Latif

Clustering Analysis based on Data Mining Applications Xuedong Fan

DATA MINING II - 1DL460

3. Data Preprocessing. 3.1 Introduction

Course : Data mining

2. Data Preprocessing

Full-Text Search on Data with Access Control

Transcription:

Tenth SIAM International Conference on Data Mining Columbus, Ohio, USA Multiresolution Motif Discovery in Time Series NUNO CASTRO PAULO AZEVEDO Department of Informatics University of Minho Portugal April 30th, 2010

Roadmap I. Motif definition II. III. IV. Motivation Related work limitations Our algorithm V. Experimental Analysis VI. Future work VII. Conclusion

I Motif Definition Motifs, also known as recurrent patterns, frequent patterns, repeated subsequences, or typical shapes are previously unknown patterns in time series

II Motivation Finding motifs is an important task: Describe the time series at hand Help summarize/represent the database Provide useful insight to the domain expert Examples of motifs: Patterns that typically precede a seizure in EEG DNA subsequence preserved through evolution Bursts in telecommunication traffic

III Related work limitations Computational complexity Quadratic algorithms are clearly not the solution Disk innefficient (use expensive random disk accesses) Memory innefficient (assume data can fit into main memory) Assume all data are available

III Related work limitations (cont.) Consider motifs at a single resolution Are not suited to interactivity Large number of unintuitive parameters to set: Motif length Range (distance threshold) Number of columns in the subsequence matrix Limited to finding motifs in univariate time series

IV Our algorithm We propose an algorithm: Multiresolution Motif Discovery in Time Series: MrMotif Time efficient: One single sequential disk scan Clever representation technique (isax) Use of constant access time structures Memory efficient: Combine our approach with the Space-Saving algorithm Adjustable amount of memory to use

IV Our algorithm Problem definition We follow a Top-K frequent pattern approach: i.e. finding the Top-K motifs A time series can be counted as a repetition of another if they have the same symbolic representation We use the Symbolic Aggregate Approximation (isax*) * Shieh, J. and Keogh, E., isax: indexing and mining terabyte sized time series, in Proceedings of the 14th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (2008), pp. 623-631.

IV Our algorithm Problem definition isax State of the art time series representation technique Widely used in time series data mining Converts a time series to a sequence of symbols (word) Given a resolution (alphabet size) and word size Image generated by MATLAB and code provided by isax authors

IV Our algorithm Problem definition isax (cont.) Ability to easily move between different resolutions Resolution Decimal word Binary word 2 { 0, 1, 1, 1, 0, 0, 0, 0} {0,1,1,1,0,0,0,0} 4 {1, 2, 3, 2, 1, 0, 1, 1} {01,10,11,10,01,00,01,01} 8 {2, 5, 7, 5, 3, 0, 3, 3} {010,101,111,101,011,000,011,011} 16 {5, 11, 15, 11, 6, 1, 6, 6} {0101,1011,1111,1011,0110,0001,0110,0110} Resolution = 4 Resolution = 16 Image generated by MATLAB and code provided by isax authors

IV Our algorithm Problem definition (cont.) Example of 3 time series that form a motif Our motif is the word: { 1, 1, 3, 8, 11, 12, 13, 13 }

IV Our algorithm MrMotif Perform one traversal of the time series database For each resolution Convert each time series to an isax word Maintain and update a counter of the current Top-K motifs, indexed by isax word e.g. resolution 2 Motif Count {2,5,7,5,3,0,3,3} 54 {4,7,0,0,0,1,5,5} 32 {0,0,0,4,5,2,0,0} 25...

IV Our algorithm Properties Multiresolution Interactivity Space-Saving

IV Our algorithm Properties Multiresolution Our intuition is that at the larger resolutions, it is harder for two different time series to match Each interval narrows considerably each time we duplicate the resolution

IV Our algorithm Properties Multiresolution (cont.) At the largest resolutions, we are working closer to the level of raw data This assumption prevents us from performing expensive distance calculations The multiresolution capability allows to develop interactive visual tools

IV Our algorithm Properties Interactivity Feed a tree-like structure with our motifs at different resolutions This allows to navigate in the motif hierarchy structure

IV Our algorithm Properties Space-Saving (SS) Proposed* to efficiently compute frequent elements in data streams Monitor only m words For each new word e If e is already monitored, increment its count If not, replace the least frequent monitored element by e, and increment it Experimentally shown to guarantee very small errors, with known upper-bounds on the over-estimation errors Reference***

IV Our algorithm Properties Space-Saving (cont.) We start MrMotif with Space-Saving disabled, in order to make m large enough to further reduce errors Activate Space-Saving when memory threshold is reached (e.g. 128MB guarantees m =10000 elements) or memory is about to run out

V Experimental Analysis Scalability experiments (synthetic data) Execution time Memory Experiments with noise Real applications

V Experimental Analysis Scalability Experiments Dataset: Reproduced from Mueen et al., 2009*. 10 different sets of random walk time series Each set with 10000 up to 100000 series of length 1024 About 8GB of time series data We compare MrMotif to Random Projection (Chiu et al., 2003) Due to its popularity Is the basis of many current motif discovery approaches We also compare Space-Saving (SS) and Full Memory (FM) versions of MrMotif **Ref

V Experimental Analysis Scalability Experiments Execution time Algorithms are executed 10 times for each of the ten increasingly larger datasets Execution times for each dataset are averaged Top-10 motifs are recorded Maximum amount of memory set to 128MB

V Experimental Analysis Scalability Experiments Execution time (results) DB size MrMotif (SS) MrMotif (FM) Random Projection 10000 16,43 13,91 53,54 20000 32,68 26,85 193,88 30000 49,60 40,34 404,41 40000 62,92 51,87 705,02 50000 79,26 66,13 1221,13 60000 98,15 78,44 1613,53 70000 114,35 89,33 2139,20 80000 127,27 106,40 2708,53 90000 149,40 116,08 3468,50 100000 158,76 133,11 4357,39

V Experimental Analysis Scalability Experiments Memory We compare memory usage of the FM and SS versions of MrMotif in the 100000 sized dataset Observe the impact of SS (memory limit set to 128MB)

V Experimental Analysis Experiments with noise We apply MrMotif to the 10000 sized dataset and record the Top-10 patterns for resolution 4 MrMotif is executed in each variation of the series Precision/recall with respect to the original series are calculated

V Experimental Analysis Experiments with noise (cont.)

V Experimental Analysis Real applications We have applied MrMotif to real data from: Protein unfolding Sensor networks monitoring Telecommunication network operator

VI Conclusions We have introduced MrMotif to find motifs in time series: Fast Space-efficient Intuitive Robust to noise Easy to use Straightforward Reproducible

VII Future work Motif evaluation and significance measures: Motifs are typically evaluated in a subjective way by humans Objective evaluation measures that rank motifs in terms of significance are necessary Motifs as building blocks: As motifs can be used to describe the time series, they can be used as building blocks for other data mining tasks: Classification Abnormality detection Forecasting

Thank you for your attention! Contact: castro@di.uminho.pt MrMotif Web site (executable, source code and datasets): www.di.uminho.pt/~castro/mrmotif

On similarity and multiresolution

On similarity