Multiresolution Motif Discovery in Time Series

Tenth SIAM International Conference on Data Mining Columbus, Ohio, USA Multiresolution Motif Discovery in Time Series NUNO CASTRO PAULO AZEVEDO Department of Informatics University of Minho Portugal April 30th, 2010

Roadmap I. Motif definition II. III. IV. Motivation Related work limitations Our algorithm V. Experimental Analysis VI. Future work VII. Conclusion

I Motif Definition Motifs, also known as recurrent patterns, frequent patterns, repeated subsequences, or typical shapes are previously unknown patterns in time series

II Motivation Finding motifs is an important task: Describe the time series at hand Help summarize/represent the database Provide useful insight to the domain expert Examples of motifs: Patterns that typically precede a seizure in EEG DNA subsequence preserved through evolution Bursts in telecommunication traffic

III Related work limitations Computational complexity Quadratic algorithms are clearly not the solution Disk innefficient (use expensive random disk accesses) Memory innefficient (assume data can fit into main memory) Assume all data are available

III Related work limitations (cont.) Consider motifs at a single resolution Are not suited to interactivity Large number of unintuitive parameters to set: Motif length Range (distance threshold) Number of columns in the subsequence matrix Limited to finding motifs in univariate time series

IV Our algorithm We propose an algorithm: Multiresolution Motif Discovery in Time Series: MrMotif Time efficient: One single sequential disk scan Clever representation technique (isax) Use of constant access time structures Memory efficient: Combine our approach with the Space-Saving algorithm Adjustable amount of memory to use

IV Our algorithm Problem definition We follow a Top-K frequent pattern approach: i.e. finding the Top-K motifs A time series can be counted as a repetition of another if they have the same symbolic representation We use the Symbolic Aggregate Approximation (isax*) * Shieh, J. and Keogh, E., isax: indexing and mining terabyte sized time series, in Proceedings of the 14th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (2008), pp. 623-631.

IV Our algorithm Problem definition isax State of the art time series representation technique Widely used in time series data mining Converts a time series to a sequence of symbols (word) Given a resolution (alphabet size) and word size Image generated by MATLAB and code provided by isax authors

IV Our algorithm Problem definition isax (cont.) Ability to easily move between different resolutions Resolution Decimal word Binary word 2 { 0, 1, 1, 1, 0, 0, 0, 0} {0,1,1,1,0,0,0,0} 4 {1, 2, 3, 2, 1, 0, 1, 1} {01,10,11,10,01,00,01,01} 8 {2, 5, 7, 5, 3, 0, 3, 3} {010,101,111,101,011,000,011,011} 16 {5, 11, 15, 11, 6, 1, 6, 6} {0101,1011,1111,1011,0110,0001,0110,0110} Resolution = 4 Resolution = 16 Image generated by MATLAB and code provided by isax authors

IV Our algorithm Problem definition (cont.) Example of 3 time series that form a motif Our motif is the word: { 1, 1, 3, 8, 11, 12, 13, 13 }

IV Our algorithm MrMotif Perform one traversal of the time series database For each resolution Convert each time series to an isax word Maintain and update a counter of the current Top-K motifs, indexed by isax word e.g. resolution 2 Motif Count {2,5,7,5,3,0,3,3} 54 {4,7,0,0,0,1,5,5} 32 {0,0,0,4,5,2,0,0} 25...

IV Our algorithm Properties Multiresolution Interactivity Space-Saving

IV Our algorithm Properties Multiresolution Our intuition is that at the larger resolutions, it is harder for two different time series to match Each interval narrows considerably each time we duplicate the resolution

IV Our algorithm Properties Multiresolution (cont.) At the largest resolutions, we are working closer to the level of raw data This assumption prevents us from performing expensive distance calculations The multiresolution capability allows to develop interactive visual tools

IV Our algorithm Properties Interactivity Feed a tree-like structure with our motifs at different resolutions This allows to navigate in the motif hierarchy structure

IV Our algorithm Properties Space-Saving (SS) Proposed* to efficiently compute frequent elements in data streams Monitor only m words For each new word e If e is already monitored, increment its count If not, replace the least frequent monitored element by e, and increment it Experimentally shown to guarantee very small errors, with known upper-bounds on the over-estimation errors Reference***

IV Our algorithm Properties Space-Saving (cont.) We start MrMotif with Space-Saving disabled, in order to make m large enough to further reduce errors Activate Space-Saving when memory threshold is reached (e.g. 128MB guarantees m =10000 elements) or memory is about to run out

V Experimental Analysis Scalability experiments (synthetic data) Execution time Memory Experiments with noise Real applications

V Experimental Analysis Scalability Experiments Dataset: Reproduced from Mueen et al., 2009*. 10 different sets of random walk time series Each set with 10000 up to 100000 series of length 1024 About 8GB of time series data We compare MrMotif to Random Projection (Chiu et al., 2003) Due to its popularity Is the basis of many current motif discovery approaches We also compare Space-Saving (SS) and Full Memory (FM) versions of MrMotif **Ref

V Experimental Analysis Scalability Experiments Execution time Algorithms are executed 10 times for each of the ten increasingly larger datasets Execution times for each dataset are averaged Top-10 motifs are recorded Maximum amount of memory set to 128MB

V Experimental Analysis Scalability Experiments Execution time (results) DB size MrMotif (SS) MrMotif (FM) Random Projection 10000 16,43 13,91 53,54 20000 32,68 26,85 193,88 30000 49,60 40,34 404,41 40000 62,92 51,87 705,02 50000 79,26 66,13 1221,13 60000 98,15 78,44 1613,53 70000 114,35 89,33 2139,20 80000 127,27 106,40 2708,53 90000 149,40 116,08 3468,50 100000 158,76 133,11 4357,39

V Experimental Analysis Scalability Experiments Memory We compare memory usage of the FM and SS versions of MrMotif in the 100000 sized dataset Observe the impact of SS (memory limit set to 128MB)

V Experimental Analysis Experiments with noise We apply MrMotif to the 10000 sized dataset and record the Top-10 patterns for resolution 4 MrMotif is executed in each variation of the series Precision/recall with respect to the original series are calculated

V Experimental Analysis Experiments with noise (cont.)

V Experimental Analysis Real applications We have applied MrMotif to real data from: Protein unfolding Sensor networks monitoring Telecommunication network operator

VI Conclusions We have introduced MrMotif to find motifs in time series: Fast Space-efficient Intuitive Robust to noise Easy to use Straightforward Reproducible

VII Future work Motif evaluation and significance measures: Motifs are typically evaluated in a subjective way by humans Objective evaluation measures that rank motifs in terms of significance are necessary Motifs as building blocks: As motifs can be used to describe the time series, they can be used as building blocks for other data mining tasks: Classification Abnormality detection Forecasting

Thank you for your attention! Contact: castro@di.uminho.pt MrMotif Web site (executable, source code and datasets): www.di.uminho.pt/~castro/mrmotif

On similarity and multiresolution

On similarity