Multiresolution Motif Discovery in Time Series

Size: px

Start display at page:

Download "Multiresolution Motif Discovery in Time Series"

Britton Montgomery
6 years ago
Views:

1 Tenth SIAM International Conference on Data Mining Columbus, Ohio, USA Multiresolution Motif Discovery in Time Series NUNO CASTRO PAULO AZEVEDO Department of Informatics University of Minho Portugal April 30th, 2010

2 Roadmap I. Motif definition II. III. IV. Motivation Related work limitations Our algorithm V. Experimental Analysis VI. Future work VII. Conclusion

3 I Motif Definition Motifs, also known as recurrent patterns, frequent patterns, repeated subsequences, or typical shapes are previously unknown patterns in time series

4 II Motivation Finding motifs is an important task: Describe the time series at hand Help summarize/represent the database Provide useful insight to the domain expert Examples of motifs: Patterns that typically precede a seizure in EEG DNA subsequence preserved through evolution Bursts in telecommunication traffic

5 III Related work limitations Computational complexity Quadratic algorithms are clearly not the solution Disk innefficient (use expensive random disk accesses) Memory innefficient (assume data can fit into main memory) Assume all data are available

6 III Related work limitations (cont.) Consider motifs at a single resolution Are not suited to interactivity Large number of unintuitive parameters to set: Motif length Range (distance threshold) Number of columns in the subsequence matrix Limited to finding motifs in univariate time series

7 IV Our algorithm We propose an algorithm: Multiresolution Motif Discovery in Time Series: MrMotif Time efficient: One single sequential disk scan Clever representation technique (isax) Use of constant access time structures Memory efficient: Combine our approach with the Space-Saving algorithm Adjustable amount of memory to use

8 IV Our algorithm Problem definition We follow a Top-K frequent pattern approach: i.e. finding the Top-K motifs A time series can be counted as a repetition of another if they have the same symbolic representation We use the Symbolic Aggregate Approximation (isax*) * Shieh, J. and Keogh, E., isax: indexing and mining terabyte sized time series, in Proceedings of the 14th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (2008), pp

9 IV Our algorithm Problem definition isax State of the art time series representation technique Widely used in time series data mining Converts a time series to a sequence of symbols (word) Given a resolution (alphabet size) and word size Image generated by MATLAB and code provided by isax authors

10 IV Our algorithm Problem definition isax (cont.) Ability to easily move between different resolutions Resolution Decimal word Binary word 2 { 0, 1, 1, 1, 0, 0, 0, 0} {0,1,1,1,0,0,0,0} 4 {1, 2, 3, 2, 1, 0, 1, 1} {01,10,11,10,01,00,01,01} 8 {2, 5, 7, 5, 3, 0, 3, 3} {010,101,111,101,011,000,011,011} 16 {5, 11, 15, 11, 6, 1, 6, 6} {0101,1011,1111,1011,0110,0001,0110,0110} Resolution = 4 Resolution = 16 Image generated by MATLAB and code provided by isax authors

11 IV Our algorithm Problem definition (cont.) Example of 3 time series that form a motif Our motif is the word: { 1, 1, 3, 8, 11, 12, 13, 13 }

12 IV Our algorithm MrMotif Perform one traversal of the time series database For each resolution Convert each time series to an isax word Maintain and update a counter of the current Top-K motifs, indexed by isax word e.g. resolution 2 Motif Count {2,5,7,5,3,0,3,3} 54 {4,7,0,0,0,1,5,5} 32 {0,0,0,4,5,2,0,0} 25...

13 IV Our algorithm Properties Multiresolution Interactivity Space-Saving

14 IV Our algorithm Properties Multiresolution Our intuition is that at the larger resolutions, it is harder for two different time series to match Each interval narrows considerably each time we duplicate the resolution

15 IV Our algorithm Properties Multiresolution (cont.) At the largest resolutions, we are working closer to the level of raw data This assumption prevents us from performing expensive distance calculations The multiresolution capability allows to develop interactive visual tools

16 IV Our algorithm Properties Interactivity Feed a tree-like structure with our motifs at different resolutions This allows to navigate in the motif hierarchy structure

17 IV Our algorithm Properties Space-Saving (SS) Proposed* to efficiently compute frequent elements in data streams Monitor only m words For each new word e If e is already monitored, increment its count If not, replace the least frequent monitored element by e, and increment it Experimentally shown to guarantee very small errors, with known upper-bounds on the over-estimation errors Reference***

18 IV Our algorithm Properties Space-Saving (cont.) We start MrMotif with Space-Saving disabled, in order to make m large enough to further reduce errors Activate Space-Saving when memory threshold is reached (e.g. 128MB guarantees m =10000 elements) or memory is about to run out

19 V Experimental Analysis Scalability experiments (synthetic data) Execution time Memory Experiments with noise Real applications

20 V Experimental Analysis Scalability Experiments Dataset: Reproduced from Mueen et al., 2009*. 10 different sets of random walk time series Each set with up to series of length 1024 About 8GB of time series data We compare MrMotif to Random Projection (Chiu et al., 2003) Due to its popularity Is the basis of many current motif discovery approaches We also compare Space-Saving (SS) and Full Memory (FM) versions of MrMotif **Ref

21 V Experimental Analysis Scalability Experiments Execution time Algorithms are executed 10 times for each of the ten increasingly larger datasets Execution times for each dataset are averaged Top-10 motifs are recorded Maximum amount of memory set to 128MB

22 V Experimental Analysis Scalability Experiments Execution time (results) DB size MrMotif (SS) MrMotif (FM) Random Projection ,43 13,91 53, ,68 26,85 193, ,60 40,34 404, ,92 51,87 705, ,26 66, , ,15 78, , ,35 89, , ,27 106, , ,40 116, , ,76 133, ,39

23 V Experimental Analysis Scalability Experiments Memory We compare memory usage of the FM and SS versions of MrMotif in the sized dataset Observe the impact of SS (memory limit set to 128MB)

24 V Experimental Analysis Experiments with noise We apply MrMotif to the sized dataset and record the Top-10 patterns for resolution 4 MrMotif is executed in each variation of the series Precision/recall with respect to the original series are calculated

25 V Experimental Analysis Experiments with noise (cont.)

26 V Experimental Analysis Real applications We have applied MrMotif to real data from: Protein unfolding Sensor networks monitoring Telecommunication network operator

27 VI Conclusions We have introduced MrMotif to find motifs in time series: Fast Space-efficient Intuitive Robust to noise Easy to use Straightforward Reproducible

28 VII Future work Motif evaluation and significance measures: Motifs are typically evaluated in a subjective way by humans Objective evaluation measures that rank motifs in terms of significance are necessary Motifs as building blocks: As motifs can be used to describe the time series, they can be used as building blocks for other data mining tasks: Classification Abnormality detection Forecasting

29 Thank you for your attention! Contact: MrMotif Web site (executable, source code and datasets):

30 On similarity and multiresolution

31 On similarity

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation Ayaka ONISHI 1, and Chiemi WATANABE 2 1,2 Graduate School of Humanities and Sciences, Ochanomizu University,