LASH: Large-Scale Sequence Mining with Hierarchies

Similar documents
Mind the Gap: Large-Scale Frequent Sequence Mining

Distributed frequent sequence mining with declarative subsequence constraints. Alexander Renz-Wieland April 26, 2017

LASH: Large-Scale Sequence Mining with Hierarchies

CIS-331 Exam 2 Fall 2014 Total of 105 Points. Version 1

Distributed frequent sequence mining with declarative subsequence constraints

CIS-331 Fall 2013 Exam 1 Name: Total of 120 Points Version 1

CIS-331 Fall 2014 Exam 1 Name: Total of 109 Points Version 1

CIS-331 Spring 2016 Exam 1 Name: Total of 109 Points Version 1

CIS-331 Exam 2 Fall 2015 Total of 105 Points Version 1

CIS-331 Exam 2 Spring 2016 Total of 110 Points Version 1

4. Specifications and Additional Information

CIS-331 Final Exam Spring 2018 Total of 120 Points. Version 1

ECHO Process Instrumentation, Inc. Modbus RS485 Module. Operating Instructions. Version 1.0 June 2010

Chapter 4: Mining Frequent Patterns, Associations and Correlations

CIS-331 Final Exam Spring 2016 Total of 120 Points. Version 1

CS145: INTRODUCTION TO DATA MINING

2-Type Series Pressurized Closures

CIS-331 Final Exam Spring 2015 Total of 115 Points. Version 1

The cache is 4-way set associative, with 4-byte blocks, and 16 total lines

APPLESHARE PC UPDATE INTERNATIONAL SUPPORT IN APPLESHARE PC

Gateway Ascii Command Protocol

Transactions in Euclidean Geometry

RS 232 PINOUTS. 1. We use RJ12 for all of our RS232 interfaces (Link-2-Modbus & Link-2-PC- Serial/RS232). The diagram below shows our pin out.

CHAPTER 8. ITEMSET MINING 226

First Data EMV Test Card Set. Version 1.30

ZN-DN312XE-M Quick User Guide

First Data EMV Test Card Set. Version 2.00

Chapter 7: Frequent Itemsets and Association Rules

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Hartmann HONORS Geometry Chapter 3 Formative Assessment * Required

Effectiveness of Freq Pat Mining

DBK24. Isolated Digital Output Chassis. Overview

Chapter 13, Sequence Data Mining

Hash Constant C Determinants leading to collisionfree

VM7000A PAPERLESS RECORDER COMMUNICATION FUNCTION OPERATION MANUAL

Frequent Pattern Mining

July Registration of a Cyrillic Character Set. Status of this Memo

One subset of FEAL, called FEAL-NX, is N round FEAL using a 128-bit key without key parity.

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Mining Association Rules in Large Databases

Mining Frequent Patterns without Candidate Generation

C1098 JPEG Module User Manual

ENGI 8868/9877 Computer and Communications Security III. BLOCK CIPHERS. Symmetric Key Cryptography. insecure channel

Unit 5 Triangle Congruence

Fundamentals of Cryptography

Stream Ciphers and Block Ciphers

First Data Dual Interface EMV Test Card Set. Version 1.20

Triple DES and AES 192/256 Implementation Notes

Building Roads. Page 2. I = {;, a, b, c, d, e, ab, ac, ad, ae, bc, bd, be, cd, ce, de, abd, abe, acd, ace, bcd, bce, bde}

CSCI 454/554 Computer and Network Security. Topic 3.1 Secret Key Cryptography Algorithms

Digital Lighting Systems, Inc.

Day 6: Triangle Congruence, Correspondence and Styles of Proof HOMEWORK

AIT 682: Network and Systems Security

Acquirer JCB EMV Test Card Set

SPAREPARTSCATALOG: CONNECTORS SPARE CONNECTORS KTM ART.-NR.: 3CM EN

Enhanced Play Fair Cipher

CIS-331 Final Exam Fall 2015 Total of 120 Points. Version 1

SPARE CONNECTORS KTM 2014

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Data Warehousing and Data Mining

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

Digital Lighting Systems, Inc. CD400-DMX DMX512 Four Channel Dimmer and Switch module

Computing n-gram Statistics in MapReduce

6. Specifications & Additional Information

ETSI TS V ( )

Performance and Scalability: Apriori Implementa6on

22ND CENTURY_J1.xls Government Site Hourly Rate

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

6.1 Combinational Circuits. George Boole ( ) Claude Shannon ( )

EDR Report Information

Secret Key Systems (block encoding) Encrypting a small block of text (say 64 bits) General considerations for cipher design:

Maharashtra Board Class IX Mathematics (Geometry) Sample Paper 1 Solution

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

CS570 Introduction to Data Mining

VM7000A PAPERLESS RECORDER COMMUNICATION FUNCTION OPERATION MANUAL

UNH-IOL MIPI Alliance Test Program

PowerPrism16 Light Controllers Instruction Guide

DATA MINING II - 1DL460

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Chapter 4: Association analysis:

Last name... You must show full analytical work to receive full credit, even on the multiple choice problems.

First Data DCC Test Card Set. Version 1.30

2 a. 3 (60 cm) cm cm 4

6.1 Font Types. Font Types

Medium Term Plan Year 4

Association Rules. A. Bellaachia Page: 1

Lecture 10 Sequential Pattern Mining

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Acquirer JCB Dual Interface EMV Test Card Set

14 Graph Theory. Exercise Set 14-1

For all questions, E. NOTA means none of the above answers is correct. Diagrams are NOT drawn to scale.

CMSC 313 Lecture 03 Multiple-byte data big-endian vs little-endian sign extension Multiplication and division Floating point formats Character Codes

Communications guide. Line Distance Protection System * F1* GE Digital Energy. Title page

Interac USA Interoperability EMV Test Card Set

Digital Lighting Systems, Inc.

Downloaded from

Digital Lighting Systems, Inc. PD405-DMX. Four Channel Dimmer and Switch Packs. DMX512 compatible. PD405-DMX-24DC DMX512 4 x 5 Amps Dimmer Pack

Transcription:

LASH: Large-Scale Sequence Mining with Hierarchies Kaustubh Beedkar and Rainer Gemulla Data and Web Science Group University of Mannheim June 2 nd, 2015 SIGMOD 2015 Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 1

Syntactic Explorer (Verb to Verb Noun) Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 2

Sequence Mining Goal: Discover subsequences as patterns in sequence data Input: Collection of sequences of items, e.g., Text collection (sequence of words) Customer transactions (sequence of products) Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3

Sequence Mining Goal: Discover subsequences as patterns in sequence data Input: Collection of sequences of items, e.g., Text collection (sequence of words) Customer transactions (sequence of products) Output: subsequences that occur in σ input sequences (frequency threshold) have length at most λ (length threshold) have gap γ (contiguous subsequences or non-contiguous subsequences) Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3

Sequence Mining Goal: Discover subsequences as patterns in sequence data Input: Collection of sequences of items, e.g., Text collection (sequence of words) Customer transactions (sequence of products) Output: subsequences that occur in σ input sequences (frequency threshold) have length at most λ (length threshold) have gap γ (contiguous subsequences or non-contiguous subsequences) Example: S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3

Sequence Mining Goal: Discover subsequences as patterns in sequence data Input: Collection of sequences of items, e.g., Text collection (sequence of words) Customer transactions (sequence of products) Output: subsequences that occur in σ input sequences (frequency threshold) have length at most λ (length threshold) have gap γ (contiguous subsequences or non-contiguous subsequences) Example: S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London Subsequence: lives in σ = 2, λ = 2, γ = 0 Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3

Hierarchies Items can be naturally arranged in a hierarchy, e.g., Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4

Hierarchies Items can be naturally arranged in a hierarchy, e.g., DET a an the Syntactic hierarchy Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4

Hierarchies Items can be naturally arranged in a hierarchy, e.g., DET a an the PERSON CITY Syntactic hierarchy Scientist Politician Melbourne... Albert Einstein...... Barack Obama Semantic hierarchy Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4

Hierarchies Items can be naturally arranged in a hierarchy, e.g., DET a an the PERSON CITY Syntactic hierarchy Scientist Politician Melbourne... Albert Einstein...... Barack Obama Photography Semantic hierarchy DSLR Camera Tripod Cannon5D Nikon5100 Product hierarchy... Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4

Sequence Mining with Hierarchies Item hierarchies are specifically taken into account Discover non-trivial patterns Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5

Sequence Mining with Hierarchies Item hierarchies are specifically taken into account Discover non-trivial patterns Example S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5

Sequence Mining with Hierarchies Item hierarchies are specifically taken into account Discover non-trivial patterns Example S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY Anna Bob Charlie Melbourne Berlin Semantic hierarchy London Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5

Sequence Mining with Hierarchies Item hierarchies are specifically taken into account Discover non-trivial patterns Example S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London Generalized subsequence: PERSON lives in CITY σ = 2, λ = 4, γ = 3 PERSON CITY Anna Bob Charlie Melbourne Berlin Semantic hierarchy London Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5

Sequence Mining with Hierarchies Applications Linguistic patterns, e.g., read DET book NNP lives in NNP Information extraction, e.g., PERSON lives in CITY Market-basket analysis, e.g, buy DSLR camera photography book flash Web-usage mining... Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 6

LASH Distributed framework for sequence mining with hierarchies D H Built over MapReduce for large-scale data processing Hierarchy-aware item-based partitioning Map (Partitioning) Divide data into potentially overlapping partitions D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n Reduce (mining) Partitions are mined independently No global post-processing F Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 7

Outline 1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 8

Item-based Partitioning D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

Item-based Partitioning Items are ordered by decreasing frequency, e.g., a < b < c < < k D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

Item-based Partitioning Items are ordered by decreasing frequency, e.g., a < b < c < < k Create a partition for each frequent item called pivot item D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

Item-based Partitioning Items are ordered by decreasing frequency, e.g., a < b < c < < k Create a partition for each frequent item called pivot item Key idea: partition the output space a }{{} F a }{{} F b < b < c < < k }{{} F c } {{ } F k D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

Item-based Partitioning Items are ordered by decreasing frequency, e.g., a < b < c < < k Create a partition for each frequent item called pivot item Key idea: partition the output space a }{{} F a }{{} F b < b < c < < k }{{} F c } {{ } F k Rewrite D for each pivot item Reduces communication Reduces computation Reduces skew D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY Anna Bob Charlie Melbourne Berlin London Semantic hierarchy Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY Anna Bob Charlie Melbourne Berlin London Semantic hierarchy PERSON < CITY < in < lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON : 3 PERSON PERSON CITY Anna Bob Charlie Melbourne Berlin London Semantic hierarchy PERSON < CITY < in < lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY PERSON : 3 PERSON PERSON _ 2 CITY : 1 PERSON _ CITY : 1 CITY Anna Bob Charlie Melbourne Berlin London Semantic hierarchy PERSON < CITY < in < lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY PERSON : 3 PERSON PERSON _ 2 CITY : 1 PERSON _ CITY : 1 CITY Anna Bob Charlie Melbourne Berlin Semantic hierarchy London PERSON _ in CITY : 1 PERSON _ in _ 3 CITY : 1 in PERSON < CITY < in < lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY PERSON : 3 PERSON PERSON _ 2 CITY : 1 PERSON _ CITY : 1 CITY Anna Bob Charlie Melbourne Berlin Semantic hierarchy London PERSON _ in CITY : 1 PERSON _ in _ 3 CITY : 1 in PERSON < CITY < in < lives PERSON lives in CITY : 1 PERSON lives in _ 3 CITY : 1 lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

Outline 1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 11

Local Mining Goal: Compute pivot sequences a }{{} F a }{{} F b < b < c < < k }{{} F c } {{ } F k D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 12

Local Mining Traditional approach Use any mining algorithm (based on depth-first or breadth-first search) Filter out non-pivot sequences Example: depth-first search Pivot item: e a b c d e aa ab ac ae bd ba be cd ce da db dc ea eb ec ed ee abd abe acd ace aee aea aeb aec aed dab dac dae ebd aecd daec Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 13

Local Mining Pivot sequence miner (PSM) Mines only pivot sequences Start with the pivot item Right expansions Left expansions Optimized search space exploration Example: PSM search space Pivot item: e e be ce ae ee ea ec eb ed bae dae aeb ebd Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 14

Outline 1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 15

Overall Runtime The New York Times Corpus 50M sequences, 1B items of which 2.7M distinct Syntactic hierarchy (word lowercase lemma POS tag) 10 node hadoop cluster Total time (in seconds) 100 1000 10000 Naive Semi naive LASH P(1000,0,3) P(100,0,3) P(100,0,5) CLP(100,0,5) NYT Hierarchy (σ,γ,λ) > 12 hrs LASH is multiple orders of magnitude faster Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 16

Local Mining Mining time (in seconds) 100 1000 10000 BFS DFS PSM PSM + Index LP(1000,0,5) LP(100,0,5) CLP(100,0,5) CLP(100,0,7) NYT Hierarchy (σ,γ,λ) Insufficient memory PSM is effective, more than 3 faster Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 17

Scalability Total time (seconds) 0 500 1000 1500 2000 2500 Map Shuffle Reduce Total time (seconds) 0 100 300 500 700 Map Shuffle Reduce 2 4 8 2(25%) 4(50%) 8(100%) Number of machines Number of machines (% of data) (a) Strong Scalability (b) Weak Scalability Good strong and weak scalability Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 18

Outline 1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 19

Summary and Contributions Sequence mining with hierarchies is an important problem Enables mining non-trivial patterns Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20

Summary and Contributions Sequence mining with hierarchies is an important problem Enables mining non-trivial patterns LASH: LArge-scale Sequence mining with Hierarchies Novel hierarchy-aware form of item-based partitioning Efficient special-purpose algorithm for mining each partition Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20

Summary and Contributions Sequence mining with hierarchies is an important problem Enables mining non-trivial patterns LASH: LArge-scale Sequence mining with Hierarchies Novel hierarchy-aware form of item-based partitioning Efficient special-purpose algorithm for mining each partition First distributed, scalable algorithm to mine such sequences Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20

Summary and Contributions Sequence mining with hierarchies is an important problem Enables mining non-trivial patterns LASH: LArge-scale Sequence mining with Hierarchies Novel hierarchy-aware form of item-based partitioning Efficient special-purpose algorithm for mining each partition First distributed, scalable algorithm to mine such sequences Thank you! Questions? / Comments Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20