CS145: INTRODUCTION TO DATA MINING

Similar documents
Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CS6220: DATA MINING TECHNIQUES

Lecture 10 Sequential Pattern Mining

Frequent Pattern Mining

Chapter 13, Sequence Data Mining

CSE 5243 INTRO. TO DATA MINING

Advance Association Analysis

Trajectory Pattern Mining. Figures and charts are from some materials downloaded from the internet.

CIS-331 Exam 2 Fall 2015 Total of 105 Points Version 1

Data Mining: Concepts and Techniques. Chap 8. Data Streams, Time Series Data, and. Sequential Patterns. Li Xiong

CIS-331 Fall 2013 Exam 1 Name: Total of 120 Points Version 1

4. Specifications and Additional Information

CIS-331 Fall 2014 Exam 1 Name: Total of 109 Points Version 1

CIS-331 Spring 2016 Exam 1 Name: Total of 109 Points Version 1

CIS-331 Exam 2 Fall 2014 Total of 105 Points. Version 1

CIS-331 Exam 2 Spring 2016 Total of 110 Points Version 1

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

CS490D: Introduction to Data Mining Prof. Walid Aref

Chapter 6: Mining Association Rules in Large Databases

Sequential Pattern Mining: Concepts and Primitives. Mining Sequence Patterns in Transactional Databases

Mining Frequent Patterns without Candidate Generation

CS145: INTRODUCTION TO DATA MINING

CIS-331 Final Exam Spring 2015 Total of 115 Points. Version 1

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

The cache is 4-way set associative, with 4-byte blocks, and 16 total lines

Data Mining for Knowledge Management. Association Rules

Part 2. Mining Patterns in Sequential Data

CIS-331 Final Exam Spring 2018 Total of 120 Points. Version 1

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

CS6220: DATA MINING TECHNIQUES

An Algorithm for Mining Large Sequences in Databases

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

CS570 Introduction to Data Mining

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

A Comprehensive Survey on Sequential Pattern Mining

Gateway Ascii Command Protocol

c 2006 by Shengnan Cong. All rights reserved.

Data Mining: Foundation, Techniques and Applications

APPLESHARE PC UPDATE INTERNATIONAL SUPPORT IN APPLESHARE PC

Chapter 7. Advanced Frequent Pattern Mining. Meng Jiang CS412 Summer 2017: Introduction to Data Mining

TKS: Efficient Mining of Top-K Sequential Patterns

ECHO Process Instrumentation, Inc. Modbus RS485 Module. Operating Instructions. Version 1.0 June 2010

ZN-DN312XE-M Quick User Guide

Data Mining Part 3. Associations Rules

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Association Rule Mining

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Sequential PAttern Mining using A Bitmap Representation

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

An Effective Process for Finding Frequent Sequential Traversal Patterns on Varying Weight Range

One subset of FEAL, called FEAL-NX, is N round FEAL using a 128-bit key without key parity.

Triple DES and AES 192/256 Implementation Notes

C1098 JPEG Module User Manual

July Registration of a Cyrillic Character Set. Status of this Memo

CS145: INTRODUCTION TO DATA MINING

Mining more complex patterns: frequent subsequences and subgraphs. Department of Computers, Czech Technical University in Prague

Association Rules. A. Bellaachia Page: 1

DATA MINING II - 1DL460

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Data Mining Techniques

Mining Maximal Sequential Patterns without Candidate Maintenance

Effectiveness of Freq Pat Mining

First Data EMV Test Card Set. Version 1.30

First Data EMV Test Card Set. Version 2.00

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #1

Association Rule Mining

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining

6.1 Combinational Circuits. George Boole ( ) Claude Shannon ( )

RS 232 PINOUTS. 1. We use RJ12 for all of our RS232 interfaces (Link-2-Modbus & Link-2-PC- Serial/RS232). The diagram below shows our pin out.

CIS-331 Final Exam Spring 2016 Total of 120 Points. Version 1

6. Specifications & Additional Information

DBK24. Isolated Digital Output Chassis. Overview

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Chapter 7: Frequent Itemsets and Association Rules

Mining Association Rules in Large Databases

Hash Constant C Determinants leading to collisionfree

BCB 713 Module Spring 2011

CDR File Information. Comments Direct PCM

TEST DVD-VIDEO/ DVD-ROM For Checking DVD Players, DVD Recorders and DVD Drives TDH-940

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Frequent Pattern Mining

Chapter 2. Related Work

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

Discover Sequential Patterns in Incremental Database

Fast Accumulation Lattice Algorithm for Mining Sequential Patterns

A Sampling-based Framework For Parallel Data Mining

Enhanced Play Fair Cipher

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Digital Lighting Systems, Inc.

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

A Survey of Sequential Pattern Mining

Data warehouse and Data Mining

Transcription:

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou Sun yzsun@cs.ucla.edu November 27, 2017

Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering Logistic Regression; Decision Tree; KNN; SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models Naïve Bayes for Text PLSA Prediction Frequent Pattern Mining Similarity Search Linear Regression GLM* Apriori; FP growth GSP; PrefixSpan DTW 2

Introduction Sequence Data GSP PrefixSpan Summary 3

Sequence Database A sequence database consists of sequences of ordered elements or events, recorded with or without a concrete notion of time. SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 4

Music: midi files Example: Music 5

Example: DNA Sequence 6

Sequence Databases & Sequential Patterns Transaction databases vs. sequence databases Frequent patterns vs. (frequent) sequential patterns Applications of sequential pattern mining Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams Program execution sequence data sets DNA sequences and gene structures 7

What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 8

Event / element Sequence An non-empty set of items, e.g., e=(ab) Sequence An ordered list of events, e.g., s =< e 1 e 2 e l > Length of a sequence The number of instances of items in a sequence The length of < (ef) (ab) (df) c b > is 8 (Not 5!) 9

Subsequence Subsequence For two sequences α =< a 1 a 2 a n > and β =< b 1 b 2 b m >, α is called a subsequence of β if there exists integers 1 j 1 < j 2 < < j n m, such that a 1 b j1,, a n b jn Supersequence If α is a subsequence of β, β is a supersequence of α e.g., <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> 10

Sequential Pattern Support of a sequence α Number of sequences in the database that are supersequence of α Support S α α is frequent if Support S α min _support A frequent sequence is called sequential pattern l-pattern if the length of the sequence is l 11

Example SID A sequence database sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 12

Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of userspecific constraints 13

Sequential Pattern Mining Algorithms Concept introduction and an initial Apriori-like algorithm Agrawal & Srikant. Mining sequential patterns, ICDE 95 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT 96) Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.@kdd 00; Pei, et al.@icde 01) Vertical format-based mining: SPADE (Zaki@Machine Leanining 00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB 99; Pei, Han, Wang @ CIKM 02) Mining closed sequential patterns: CloSpan (Yan, Han & Afshar @SDM 03) 14

Introduction Sequence Data GSP PrefixSpan Summary 15

The Apriori Property of Sequential Patterns A basic property: Apriori (Agrawal & Sirkant 94) If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent so do <hab> and <(ah)b> Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Given support threshold min_sup =2 16

GSP Generalized Sequential Pattern Mining GSP (Generalized Sequential Pattern) mining algorithm proposed by Agrawal and Srikant, EDBT 96 Outline of the method Initially, every item in DB is a candidate of length-1 for each level (i.e., sequences of length-k) do scan database to collect support count for each candidate sequence generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori repeat until no frequent sequence or no candidate can be found Major strength: Candidate pruning by Apriori 17

Finding Length-1 Sequential Patterns Examine GSP using an example Initial candidates: all singleton sequences <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates min_sup =2 Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1 18

GSP: Generating Length-2 Candidates <a> <b> <c> <d> <e> <f> <a> <aa> <ab> <ac> <ad> <ae> <af> 51 length-2 Candidates <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff> <a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates 19

How to Generate Candidates in From L k 1 to C k Step 1: join General? s 1 and s 2 can join, if dropping first item in s 1 is the same as dropping the last item in s 2 Examples: <(12)3> join <(2)34> = <(12)34> <(12)3> join <(2)(34)> = <(12)(34)> Step 2: pruning Check whether all length k-1 subsequences of a candidate is contained in L k 1 20

The GSP Mining Process 5 th scan: 1 cand. 1 length-5 seq. pat. <(bd)cba> Cand. cannot pass sup. threshold 4 th scan: 8 cand. 7 length-4 seq. pat. 3 rd scan: 46 cand. 20 length-3 seq. pat. 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 1 st scan: 8 cand. 6 length-1 seq. pat. <abba> <(bd)bc> <abb> <aab> <aba> <baa> <bab> <aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)> <a> <b> <c> <d> <e> <f> <g> <h> Cand. not in DB at all min_sup =2 Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> 21

Candidate Generate-and-test: Drawbacks A huge set of candidate sequences generated. Especially 2-item candidate sequence. Multiple Scans of database needed. The length of each candidate grows by one at each database scan. Inefficient for mining long sequential patterns. A long pattern grow up from short patterns The number of short patterns is exponential to the length of mined patterns. 22

*The SPADE Algorithm SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 A vertical format sequential pattern mining method A sequence database is mapped to a large set of Item: <SID, EID> Sequential pattern mining is performed by growing the subsequences (patterns) one item at a time by Apriori candidate generation November 27, 2017 Data Mining: Concepts and Techniques 23

*The SPADE Algorithm Join two tables November 27, 2017 Data Mining: Concepts and Techniques 24

Bottlenecks of GSP and SPADE A huge set of candidates could be generated 1,000 frequent length-1 sequences generate s huge number of length-2 candidates! Multiple scans of database in mining Breadth-first search Mining long sequential patterns 1000 999 1000 1000 1,499,500 2 Needs an exponential number of short candidates A length-100 sequential pattern needs 10 30 candidate sequences! 100 i 1 100 i 2 100 1 10 30 November 27, 2017 Data Mining: Concepts and Techniques 25

Introduction Sequence Data GSP PrefixSpan Summary 26

Prefix and Suffix Assume a pre-specified order on items, e.g., alphabetical order <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> Note <a(ac)> is not a prefix of <a(abc)(ac)d(cf)> Given sequence <a(abc)(ac)d(cf)> Prefix <a> <aa> <a(ab)> Suffix <(abc)(ac)d(cf)> <(_bc)(ac)d(cf)> <(_c)(ac)d(cf)> (_bc) means: the last element in the prefix together with (bc) form one element 27

Prefix-based Projection Given a sequence, α, let β be a subsequence of α, and α is be subsequence of α α is called a projection of α w.r.t. prefix β, if only and only if α has prefix β, and α is the maximum subsequence of α with prefix β Example: <ad(cf)> is a projection of <a(abc)(ac)d(cf)> w.r.t. prefix <ad> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 28

Projected (Suffix) Database Let α be a sequential pattern, α-projected database is the collection of suffixes of projections of sequences in the database w.r.t. prefix α Examples <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> <ab>-projected database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> <(_c)(ac)d(cf)> (<a(bc)(ac)d(cf)> is the projection of <a(abc)(ac)d(cf)> w.r.t. prefix <ab>) <(_c)(ae)> (<a(bc)(ae)> is the projection of <(ad)c(bc)(ae)>) w.r.t. prefix <ab>) <c> (<abc> is the projection of <eg(af)cbc> w.r.t prefix <ab>) 29

Mining Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; The ones having prefix <f> Step 3: mine each subset recursively via corresponding projected databases SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 30

Finding Seq. Patterns with Prefix <a> Only need to consider projections w.r.t. <a> <a>-projected (suffix) database: <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Further partition into 6 subsets Having prefix <aa>; Having prefix <af> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 31

Why are those 6 subsets? By scanning the <a>-projected database once, its locally frequent items are identified as a : 2, b : 4, _b : 2, c : 4, d : 2, and f : 2. Thus all the length-2 sequential patterns prefixed with <a> are found, and they are: <aa> : 2, <ab> : 4, <(ab)> : 2, <ac> : 4, <ad> : 2, and <af > : 2. 32

Completeness of PrefixSpan SDB SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <a> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> Having prefix <b> Having prefix <c>,, <f> <b>-projected database Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Having prefix <aa> Having prefix <af> <aa>-proj. db <af>-proj. db 33

Examples <aa>-projected database <(_bc)(ac)d(cf)> <(_e)> <ab>-projected database <(_c)(ac)d(cf)> <(_c)(ae)> <c> <(ab)>-projected database <(_c)(ac)d(cf)> <(df)cb> <a>-projected database: <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> Reference: http://hanj.cs.illinois.edu/pdf/tkde04_spgjn.pdf 34

Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: Constructing projected databases Can be improved by pseudo-projections 35

Speed-up by Pseudo-projection Major cost of PrefixSpan: projection Postfixes of sequences often appear repeatedly in recursive projected databases When (projected) database can be held in main memory, use pointers to form projections s=<a(abc)(ac)d(cf)> <a> Pointer to the sequence Offset of the postfix s <a>: (, 2) s <ab>: (, 4) <(abc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)> 36

Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes Efficient in running time and space when database can be held in main memory However, it is not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: Integration of physical and pseudo-projection Swapping to pseudo-projection when the data set fits in memory 37

Performance on Data Set C10T8S8I8 Data Mining: Concepts and Techniques 38

Performance on Data Set Gazelle Data Mining: Concepts and Techniques 39

Effect of Pseudo-Projection Data Mining: Concepts and Techniques 40

Introduction Sequence Data GSP PrefixSpan Summary 41

Summary Sequential Pattern Mining GSP, PrefixSpan 42