Data Mining: Concepts and Techniques. Chap 8. Data Streams, Time Series Data, and. Sequential Patterns. Li Xiong

Size: px

Start display at page:

Download "Data Mining: Concepts and Techniques. Chap 8. Data Streams, Time Series Data, and. Sequential Patterns. Li Xiong"

Dennis Lynch
6 years ago
Views:

1 Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential Patterns Li Xiong Slides credits: Jiawei Han and Micheline Kamber and others March 27, 2008 Data Mining: Concepts and Techniques 1

2 Mining Stream, Time-Series, and Sequence Data Mining data streams Mining time-series data Mining sequence data March 27, 2008 Data Mining: Concepts and Techniques 2

3 Mining Data Streams Stream data and stream data processing Basic methodologies for stream data processing and mining Stream frequent pattern analysis Stream classification Stream cluster analysis March 27, 2008 Data Mining: Concepts and Techniques 3

4 Data Streams Data Streams A sequence of data in transmission An ordered pair (s, ) where: s is a sequence of tuples, is the sequence of time intervals Characteristics Continuous Huge volumes, possibly infinite Fast changing and requires fast, real-time response Random access is expensive single scan algorithm Low-level or multi-dimensional in nature March 27, 2008 Data Mining: Concepts and Techniques 4

5 Stream Data Applications Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply & manufacturing Sensor, monitoring & surveillance: video streams, RFIDs Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too expensive) March 27, 2008 Data Mining: Concepts and Techniques 5

6 Architecture: Stream Query Processing and Mining SDMS (Stream Data Management System) User/Application Continuous Query Results Multiple streams Stream Query Processor Scratch Space (Main memory and/or Disk) March 27, 2008 Data Mining: Concepts and Techniques 6

7 DBMS versus DSMS Persistent relations One-time queries Random access Unbounded disk store Only current state matters No real-time services Relatively low update rate Data at any granularity Assume precise data Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-gb arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data arrival and characteristics Ack. From Motwani s PODS tutorial slides March 27, 2008 Data Mining: Concepts and Techniques 7

8 Mining Data Streams Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis March 27, 2008 Data Mining: Concepts and Techniques 8

9 Methodologies for Stream Data Processing Major challenges Keep track of a large universe Methodology Choosing a subset of data Sampling Sliding windows Load shedding Summarizing the data Synopses (trade-off between accuracy and storage) March 27, 2008 Data Mining: Concepts and Techniques 9

10 Random Sampling: Uniform Sampling Uniform sampling Data stream of size N Assume all samples are equally likely Example a data stream of size 4 (also called population) possible samples of size 2 Slides: R. Gemulla, W. Lehner, P. J. Haas

11 Random Sampling: Reservoir Sampling Reservoir sampling Single-scan algorithm Compute a uniform sample of M elements without N Idea Maintain a reservoir, which form a random sample of the elements seen so far in the stream Algorithm add the first M elements Afterwards at item i, flip a coin a) ignore the element (reject) b) replace a random element in the sample (accept) sample size P( t i is accepted) = = current population size M i Slides: R. Gemulla, W. Lehner, P. J. Haas

12 Random Sampling: Reservoir Sampling (Example) Example data stream sample size M = 2 1/3 1/3 1/3 2/4 1/4 1/4 2/4 1/4 1/4 2/4 1/4 1/4

13 Sliding Windows Sliding Windows Make decisions based only on recent data of sliding window size w An element arriving at time t expires at time t + w Why? Approximation technique for bounded memory Natural in applications (emphasizes recent data) Well-specified and deterministic semantics PODS

14 Load shedding Load Shedding Discards some data so the system can flow Techniques Filters (semantic drop) Chooses what to shed based on QoS, selectivity Drops (random drop) Eliminates a random fraction of input Hospital example Load shedding based on condition Patients Doctors Condition Filter Patients Doctors Join Join Doctors who can work on a patient Doctors who can work on a patient

15 Synopsis Synopsis Summaries for data Can be used to return approximate answers 0 Trade off between space and accuracy 1 Techniques Histograms Wavelets Sketching May require multiple passes Synopses/Data Structures March 27,

16 Mining Data Streams Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis Research issues March 27, 2008 Data Mining: Concepts and Techniques 16

17 Issues Frequent Pattern Mining for Data Streams Multiple scans for training not feasible Memory/space management Concept drift Methods Approximate frequent patterns (Manku & Motwani VLDB 02) Mining evolution of freq. patterns (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003) Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05) March 27, 2008 Data Mining: Concepts and Techniques 17

18 Mining Approximate Frequent Patterns Lossy Counting Algorithm (Manku & Motwani, VLDB 02) Motivation Mining precise freq. patterns in stream data: unrealistic Approximate answers are often sufficient (e.g., trend/pattern analysis) Example: a router interested in all flows whose frequency is at least 1% (σ) of the entire traffic stream seen so far; 1/10 of σ (ε = 0.1%) error is comfortable Major ideas: approximation by tracing only frequent items Adv: guaranteed error bound Disadv: keep a large set of traces March 27, 2008 Data Mining: Concepts and Techniques 18

19 Lossy Counting for Frequent Items Bucket 1 Bucket 2 Bucket 3 Input variables ϭ: min_support, ε: error bound Fixed variables w=1/ ε: window size Running variables N: current stream length bcurrent = ε N: the current bucket fe: the real frequency count of element e Set of (e, f, ): (element, approximate frequency, max error) March 27, 2008 Data Mining: Concepts and Techniques 19

20 Lossy Counting for Frequent Items Bucket 1 Bucket 2 Bucket 3 For each new element e If an entry for e exists, then incrementing its frequency f by 1 Otherwise, create a new entry (e, 1, bcurrent -1) At bucket boundaries Decrement frequency of all entries by 1 Delete entries with f+ <= bcurrent March 27, 2008 Data Mining: Concepts and Techniques 20

21 Illustration bcurrent=1 (e, f, ) Empty (summary) + bcurrent (e, f, ) + March 27, 2008 Data Mining: Concepts and Techniques 21

22 Approximation Guarantee Output: items with frequency counts exceeding (σ ε) N Error analysis: how much do we undercount? If stream length seen so far = N and bucket-size = 1/ε then frequency count error #buckets = εn Approximation guarantee No false negatives False positives have true frequency count at least (σ ε)n Frequency count underestimated by at most εn March 27, 2008 Data Mining: Concepts and Techniques 22

23 Lossy Counting For Frequent Itemsets Divide Stream into Buckets as for itemsets Bucket 1 Bucket 2 Bucket 3 Set of (set, f, ): (itemset, approximate frequency, max error) March 27, 2008 Data Mining: Concepts and Techniques 23

24 Update of Summary Data Structure summary data Processing 3 buckets in memory summary data March 27, 2008 Data Mining: Concepts and Techniques 24

25 Summary of Lossy Counting Strength A simple idea Can be extended to frequent itemsets Weakness: Space Bound is not good For frequent itemsets, they do scan each record many times The output is based on all previous data. But sometimes, we are only interested in recent data March 27, 2008 Data Mining: Concepts and Techniques 25

26 Mining Evolution of Frequent Patterns for Stream Data Mining evolution and dramatic changes of frequent patterns (Giannella, Han, Yan, Yu, 2003) Use tilted time window frame Use compressed form to store significant (approximate) frequent patterns and their time-dependent traces March 27, 2008 Data Mining: Concepts and Techniques 26

27 A Titled Time Model Natural tilted time frame: Example: Minimal: quarter, then 4 quarters 1 hour, 24 hours day, 12 months 31 days 24 hours 4qtrs time Logarithmic tilted time frame: Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, 64t 32t 16t 8t 4t 2t t t Time March 27, 2008 Data Mining: Concepts and Techniques 27

28 Two Structures for Mining Frequent Patterns with Tilted-Time Window (1) FP-Trees store Frequent Patterns Tilted-time major: An FP-tree for each tilted time frame March 27, 2008 Data Mining: Concepts and Techniques 28

29 Frequent Pattern & Tilted-Time Window (2) The second data structure: Observation: FP-Trees of different time units are similar Pattern-tree major: each node is associated with a tilted-time window March 27, 2008 Data Mining: Concepts and Techniques 29

30 Mining Data Streams Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis March 27, 2008 Data Mining: Concepts and Techniques 30

31 Classification for Dynamic Data Streams Issues Multiple scans for training not feasible Concept drift Methods VFDT (Very Fast Decision Tree) and CVFDT (Concept-adapting Very Fast Decision Tree) (Domingos, Hulten, Spencer, KDD00/KDD01) Ensemble (Wang, Fan, Yu, Han. KDD 03) K-nearest neighbors (Aggarwal, Han, Wang, Yu. KDD 04) March 27, 2008 Data Mining: Concepts and Techniques 31

32 VFDT Basic idea Consider only a small subset of training examples to find best split attribute at a node given a split evaluation measure G How many examples are necessary at each node? Statistical foundation: Hoeffding Bound (Additive Chernoff Bound) r: random variable R: range of r n: # independent observations True mean of r is at least r avg ε, with probability 1 δ ε = R 2 ln( 1 2 n / δ ) Given observed best attribute X a and second best attribute X b if G = G(X a ) G(X b ) > ε, then G >= G - ε > 0 with probability 1- δ March 27, 2008 Data Mining: Concepts and Techniques 32

33 Hoeffding Tree Algorithm Hoeffding Tree Input S: sequence of examples X: attributes G: split evaluation function (info gain, Gini index) δ: 1 - desired probability of choosing correct attribute Hoeffding Tree Algorithm for each example in S retrieve G(X a ) and G(X b ) //two highest G(X i ) compute ε if ( G(X a ) G(X b ) > ε ) split on X a recursive to next node break March 27, 2008 Data Mining: Concepts and Techniques 33

34 Decision-Tree Induction with Data Streams Packets > 10 yes no Data Stream Protocol = http Bytes > 60K yes Packets > 10 yes no Protocol = http Data Stream Protocol = ftp March 27, 2008 Slide: Gehrke 34

35 Hoeffding Tree: Strengths and Weaknesses Strengths Scales better than traditional methods Sublinear with sampling Very small memory utilization Incremental Weakness Make class predictions in parallel New examples are added as they come Could spend a lot of time with ties Memory utilization issues with tree expansion and large number of candidate attributes March 27, 2008 Data Mining: Concepts and Techniques 35

36 VFDT (Very Fast Decision Tree) Modifications to Hoeffding Tree Near-ties broken more aggressively G computed every n min Deactivates certain leaves to save memory Poor attributes dropped Initialize with traditional learner (helps learning curve) Compare to traditional decision tree Similar accuracy Better runtime with 1.61 million examples 21 minutes for VFDT 24 hours for C4.5 March 27, 2008 Data Mining: Concepts and Techniques 36

37 CVFDT (Concept-adapting VFDT) Concept Drift Time-changing data streams Incorporate new and eliminate old CVFDT Sliding window approach Increments count with new example Decrement old example Grows alternate subtrees When alternate more accurate => replace old March 27, 2008 Data Mining: Concepts and Techniques 37

38 Mining Data Streams Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis March 27, 2008 Data Mining: Concepts and Techniques 38

39 Stream Cluster Analysis Issues Multiple scan not feasible Memory and time constraints Concept drift Methods STREAM based on k-medians [GMMO01] CLuStream based on microclustering and macroclustering (Agarwal, Han, Wang, Yu, VLDB 03) March 27, 2008 Data Mining: Concepts and Techniques 39

40 STREAM [GMMO01] Problem: find k clusters in the stream s.t. the sum of distances from data points to their closest center is minimized (k-median method) Basic idea: divide-and-conquer Approximation algorithm 1. For each set of M records, S i, perform k-median clustering and find O(k) centers Only retain center information (weighted by # points assigned to the cluster) 2. When there are enough centers, cluster the weighted centers March 27, 2008 Data Mining: Concepts and Techniques 40

41 Hierarchical Clustering Tree level-(i+1) medians level-i medians data points March 27, 2008 Data Mining: Concepts and Techniques 41

42 Hierarchical Tree Method: maintain at most m level-i medians On seeing m of them, generate O(k) level-(i+1) medians of weight equal to the sum of the weights of the intermediate medians assigned to them Drawbacks: Low quality for evolving data streams (register only k centers) Limited functionality in discovering and exploring clusters over different portions of the stream over time March 27, 2008 Data Mining: Concepts and Techniques 42

43 CluStream: A Framework for Clustering Evolving Data Streams Basic idea Tilted time framework Two stages: micro-clustering and macro-clustering Algorithm Online/micro-clustering: periodically computes microclusters Given Multi-dimensional points X... X... at time stamps T... T... 1 k Cluster-feature vector (temporal extension of BIRCH) ( ) x x t t CF 2, CF1, CF 2, CF1, n Offline/macro-clustering: compute macroclusters using the k- means algorithm based on user-specified time-horizon 1 k March 27, 2008 Data Mining: Concepts and Techniques 43

44 Summary: Stream Data Mining Stream data mining: A rich and on-going research field Current research focus in database community: DSMS system architecture, continuous query processing, supporting mechanisms Stream data mining Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems March 27, 2008 Data Mining: Concepts and Techniques 44

45 References on Stream Data Mining (1) C. Aggarwal, J. Han, J. Wang, P. S. Yu. A Framework for Clustering Data Streams, VLDB'03 C. C. Aggarwal, J. Han, J. Wang and P. S. Yu. On-Demand Classification of Evolving Data Streams, KDD'04 C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A Framework for Projected Clustering of High Dimensional Data Streams, VLDB'04 S. Babu and J. Widom. Continuous Queries over Data Streams. SIGMOD Record, Sept B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and Issues in Data Stream Systems, PODS'02. (Conference tutorial) Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. "Multi-Dimensional Regression Analysis of Time-Series Data Streams, VLDB'02 P. Domingos and G. Hulten, Mining high-speed data streams, KDD'00 A. Dobra, M. N. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD 02 J. Gehrke, F. Korn, D. Srivastava. On computing correlated aggregates over continuous data streams. SIGMOD'01 C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu. Mining frequent patterns in data streams at multiple time granularities, Kargupta, et al. (eds.), Next Generation Data Mining 04 March 27, 2008 Data Mining: Concepts and Techniques 45

46 References on Stream Data Mining (2) S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams, FOCS'00 G. Hulten, L. Spencer and P. Domingos: Mining time-changing data streams. KDD 2001 S. Madden, M. Shah, J. Hellerstein, V. Raman, Continuously Adaptive Continuous Queries over Streams, SIGMOD02 G. Manku, R. Motwani. Approximate Frequency Counts over Data Streams, VLDB 02 A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. ICDT'05 S. Muthukrishnan, Data streams: algorithms and applications, Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003 R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge Univ. Press, 1995 S. Viglas and J. Naughton, Rate-Based Query Optimization for Streaming Information Sources, SIGMOD 02 Y. Zhu and D. Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time, VLDB 02 H. Wang, W. Fan, P. S. Yu, and J. Han, Mining Concept-Drifting Data Streams using Ensemble Classifiers, KDD'03 March 27, 2008 Data Mining: Concepts and Techniques 46

47 Mining Stream, Time-Series, and Sequence Data Mining data streams Mining time-series data Mining sequence data March 27, 2008 Data Mining: Concepts and Techniques 47

Time-Series Data and Time-Series Analysis Time-series data A sequences of data points measured at successive (often regular) time intervals Time-series data vs.

48 Time-Series Data and Time-Series Analysis Time-series data A sequences of data points measured at successive (often regular) time intervals Time-series data vs. data streams Can be a snapshot of data streams Persistent, various granularity Time-series analysis Understand characteristics and generating mechanism of the data trend, cycle, seasonal, irregular Make forecasts Time-series analysis vs. ordinary analysis and spatial analysis Applications Economics and finance: stock price, exchange rate Industry: power consumption Scientific: experiment results Meteorological: precipitation March 27, 2008 Data Mining: Concepts and Techniques 48

49 Time-Series Data Illustration A time series can be illustrated as a time-series graph which describes a point moving with the passage of time March 27, 2008 Data Mining: Concepts and Techniques 49

50 Identifying Patterns in Time-Series Components Long-term or trend movements (T). Long term cyclic oscillations (C). E.g. business cycles Short term oscillations (S). E.g. seasonal and calendar-related Irregular or random movements Decomposition models Additive models Multiplicative models Quarterly Gross Domestic Product March 27, 2008 Data Mining: Concepts and Techniques 50

51 Additive Models Additive Modal: TS = T + C + S + I General Government and Other Current Transfers to Other Sectors March 27, 2008 Data Mining: Concepts and Techniques 51

52 Multiplicative Models Multiplicative Modal: TS = T * C * S * I Monthly Job Advertisements March 27, 2008 Data Mining: Concepts and Techniques 52

53 Trend Analysis Trend analysis: identify the long term trend in the time series Method The freehand method Function fitting Linear vs. non-linear Preprocessing Smoothing: moving-average method Alternatives: moving mean Seasonal adjustment (deseasonalize) March 27, 2008 Data Mining: Concepts and Techniques 53

54 Seasonality Analysis Seasonality analysis: identify seasonal patterns Correlational dependency of order k between each i'th element of the series and the (i-k)'th element Method Visual identification Autocorrelation March 27, 2008 Data Mining: Concepts and Techniques 54

55 Other Components Estimation of cyclic variations Long term cyclic variations can be identified in similar manner as seasonality Estimation of irregular variations By adjusting the data for trend, seasonal and cyclic variations With the systematic analysis of the trend, cyclic, seasonal, and irregular components, it is possible to make long- or short-term predictions with reasonable quality March 27, 2008 Data Mining: Concepts and Techniques 55

56 Time-Series Forecasting Technical analysis (time series analysis) vs. fundamental analysis Models and patterns Head and shoulder pattern Random walk model Methods ARIMA model Neural networks

57 Time-Series Forecasting: ARIMA ARIMA (Auto-Regressive Integrated Moving Average) model by Box and Jenkins (1976) ARIMA(p,d,q) model Auto-regressive process AR(p): each element is made up of a random component and a linear combination of prior elements Moving average process MA(q): each element is made up of a random error component and a linear combination of prior random errors Integrated/Differenced I(d) Special cases ARIMA(0,1,0) random walk model Identification, estimation and forecasting March 27, 2008 Data Mining: Concepts and Techniques 57

58 Similarity Search in Time-Series Analysis Two categories of similarity search Whole matching: find a sequence that is similar to the query sequence Subsequence matching: find all pairs of similar sequences Typical Applications Financial market Market basket data analysis Scientific databases Medical diagnosis March 27, 2008 Data Mining: Concepts and Techniques 58

59 Similarity Search Whole matching Construct a multidimensional index based on Fourier or Wavelet coefficients Retrieve similar sequences Subsequence matching Break each sequence into a set of pieces of window with length w Use a multi-piece assembly algorithm to search for longer sequence matches March 27, 2008 Data Mining: Concepts and Techniques 59

60 References R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. FODO 93 (Foundations of Data Organization and Algorithms). R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. VLDB'95. R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes of histories. VLDB'95. C. Chatfield. The Analysis of Time Series: An Introduction, 3rd ed. Chapman & Hall, C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. SIGMOD'94. D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. SIGMOD'97. Y. Moon, K. Whang, W. Loh. Duality Based Subsequence Matching in Time-Series Databases, ICDE 02 B.-K. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time warping. ICDE'98. B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-evolving time sequences. ICDE'00. Dennis Shasha and Yunyue Zhu. High Performance Discovery in Time Series: Techniques and Case Studies, SPRINGER, 2004 March 27, 2008 Data Mining: Concepts and Techniques 60

61 Mining Stream, Time-Series, and Sequence Data Mining data streams Mining time-series data Mining sequence data March 27, 2008 Data Mining: Concepts and Techniques 61

62 Sequence Data & Sequential Patterns Sequence data A sequence of ordered data items, with or without notion of time Sequence data vs. time-series data vs. transaction data Frequent sequential pattern mining (symbolic) Applications of sequential pattern mining Customer shopping sequences Telephone calling patterns Weblog click streams March 27, 2008 Data Mining: Concepts and Techniques 62

63 Sequential Pattern Mining Agrawal and Srikant 1995 Given a set of sequences, find the complete set of frequent subsequences A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > An element contains a set of unordered items An l-sequence is a sequence of length l (l items) <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern March 27, 2008 Data Mining: Concepts and Techniques 63

64 Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints March 27, 2008 Data Mining: Concepts and Techniques 64

65 Sequential Pattern Mining Algorithms Concept introduction and an initial Apriori-like algorithm Agrawal & Srikant. Mining sequential patterns, ICDE 95 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & EDBT 96) Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.@kdd 00; Pei, et al.@icde 01) Vertical format-based mining: SPADE (Zaki@Machine Leanining 00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB 99; Pei, Han, CIKM 02) Mining closed sequential patterns: CloSpan (Yan, Han & 03) March 27, 2008 Data Mining: Concepts and Techniques 65

66 GSP Generalized Sequential Pattern Mining GSP (Generalized Sequential Pattern) mining algorithm proposed by Agrawal and Srikant, EDBT 96 Outline of the method Initially, every item in DB is a candidate of length-1 for each level (i.e., sequences of length-k) do scan database to collect support count for each candidate sequence generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori repeat until no frequent sequence or no candidate can be found Major strength: Candidate pruning by Apriori March 27, 2008 Data Mining: Concepts and Techniques 66

67 The Apriori Property of Sequential Patterns A basic property: Apriori (Agrawal & Sirkant 94) If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent so do <hab> and <(ah)b> Seq. ID Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Given support threshold min_sup =2 March 27, 2008 Data Mining: Concepts and Techniques 67

68 GSP Example: Finding Length-1 Patterns Initial candidates: all singleton sequences <a>, , <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates min_sup =2 Seq. ID Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Cand Sup <a> 3 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1 March 27, 2008 Data Mining: Concepts and Techniques 68

69 GSP Example: Generating Length-2 Candidates 2-element sequences <a> <c> <d> <e> <f> <a> <aa> <ab> <ac> <ad> <ae> <af> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff> 1-element sequences <a> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f> With Apriori: 6*6+6*5/2 = 51 candidates Without Apriori: 8*8+8*7/2 = 92 candidates Apriori prunes 44.57% candidates March 27, 2008 Data Mining: Concepts and Techniques 69

70 GSP Example 5 th scan: 1 cand. 1 length-5 pat. <(bd)cba> Cand. cannot pass sup. threshold 4 th scan: 8 cand. 6 length-4 pat. <abba> <(bd)bc> Cand. not in DB at all 3 rd scan: 47 cand. 19 length-3 pat. <abb> <aab> <aba> <baa> <bab> 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 pat. 10 cand. not in DB at all <aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)> 1 st scan: 8 cand. 6 length-1 pat. <a> <c> <d> <e> <f> <g> <h> min_sup =2 Seq. ID <(be)(ce)d> 50 <a(bd)bcb(ade)> March 27, 2008 Data Mining: Concepts and Techniques Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf>

71 Candidate Generate-and-test: Drawbacks A huge set of candidate sequences generated. Especially 2-item candidate sequence. Multiple Scans of database needed. The length of each candidate grows by one at each database scan. Inefficient for mining long sequential patterns. A long pattern grow up from short patterns The number of short patterns is exponential to the length of mined patterns. March 27, 2008 Data Mining: Concepts and Techniques 71

72 PrefixSpan PrefixSpan (Han et 00) Divide and conquer Grow frequent patterns in projected database No candidate sequence needs to be generated Major cost: constructing projected databases March 27, 2008 Data Mining: Concepts and Techniques 72

73 The SPADE Algorithm SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 A vertical format sequential pattern mining method A sequence database is mapped to a large set of Item: <SID, EID> Sequential pattern mining is performed by growing the subsequences (patterns) one item at a time by Apriori candidate generation March 27, 2008 Data Mining: Concepts and Techniques 73

74 Ref: Mining Sequential Patterns R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 96. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97. M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE 04). J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases, CIKM'02. X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03. J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04. H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in Large Database, KDD'04. J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99. J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data, KDD'00. March 27, 2008 Data Mining: Concepts and Techniques 74

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Sequence Data Sequence Database: Timeline 10 15 20 25 30 35 Object Timestamp Events A 10 2, 3, 5 A 20 6, 1 A 23 1 B 11 4, 5, 6 B